Spatial Accuracy Assessment in Natural Resources and

This file was created by scanning the printed publication. Errors identified by the software have been corrected; however, some errors may remain.

Understanding the Spatial Distribution of Tree Species in Pennsylvania Rachel Riemann Hersheyl Abstract.--Current, accessible information on the distribution of tree species would aid in the understanding and management of ecosystems. However, such detailed information on forest composition is only available from ground inventory. Geostatistical techniques are used here to create an interpolated dataset, a 'map' of individual species distribution, from known sample information. In a previous study, we found that indicator kriging and sequential gaussian conditional simulation (sgCS) were promising tools for estimating sugar maple distribution from the USDA National Forest Inventory and Analysis (FIA) data. The techniques provided an estimate of species occurrence and a measure of uncertainty associated with that estimate, while retaining much of the local variability present in the sample data. In this study, these techniques are applied to 9 additional species in Pennsylvania.. Four output datasets are available for each species--the probability of species occurrence, an estimate of its relative abundance, and a plus and minus level of uncertainty associated with that estimate. The datasets, used in conjunction with one another, provide the user with considerable flexibility in setting up the balance of errors of omission and commission that best suit the analysis under consideration. Similarities and differences between the species are identified and discussed as to their possible effect on the final estimates. Examples of how the datasets can be used are also presented. Indicator kriging and sgCS, used in conjunction with FIA sample data, provide a relatively straightforward technique to describe species occurrence and relative density across a state.

INTRODUCTION Data describing forest composition--so desired as a basic data source for many aspects of ecosystem analyses, models, and management--are generally unavailable andlor is stored in fixed forest-type categories. But forest communities are often not well characterized by the discrete categories imposed by forest cover type divisions. Inherent in each forest type category is an entire continuum (usually multi-dimensional) of different species and their relative importance. In a previous study, we compared the geostatistical techniques available to interpolate FIA sample data to create a 'map' of tree species distribution? The

orth the astern Forest Experiment Station, USDA Forest Service, 5 Radnor Corporate Center, Radnor, PA

73

19087-4585.

tools of ordinary kriging, multigaussian kriging, indicator kriging, and sequential gaussian conditional simulation (sgCS) were used to estimate the occurrence and distribution of sugar maple in Pennsylvania. After considering the phenomenon being examined, the sample data being used, and the kind(s) of output desired in this study, we decided that indicator kriging sgCS proved to be the best interpolation tools for: a) b) c) d) e) f)

providing an estimate of sugar maple occurrence, providing an estimate of sugar maple 'importance' in terms of %ba/acre, providing a measure of uncertainty associated with the two estimates, maintaining local variability, maintaining the characteristics of the original sample data, and handling sample data with highly skewed distributions.

a) An estimate of each species' presence or absence was provided by indicator kriging. An indicator transform divides the data into two classes--either above or below a designated cutoff value; in this study O%balacre indicating presence or absence. Indicator kriging calculates for each cell an estimate of the probability that it falls above or below the cutoff value. The output dataset thus indicated the probability that sugar maple occurred at each location. b) The second piece of information desired was an estimate of the relative amount of sugar maple at that location--i.e., whether the species represented a minor, moderate, or a major component of the total balacre on the plot at that location. Sequential gaussian conditional simulation determines multiple estimates for each cell. All are equally probable, and yet alternative realizations of the data determined fiom multiple simulation runs. From this set of estimates, an entire distribution can be built for each cell, representing the range of possible values. A summary statistic such as the mean or median of this distribution can then be chosen and used as the modeled 'estimate' of %ba/acre for that cell? c) A level of uncertainty is always associated with any estimate. Knowing how much uncertainty exists will help the user identify whether that uncertainty is acceptable for a specific task and how the data can be used. Correspondingly, knowing how much uncertainty exists will help the producer identify areas in which additional sampling would most improve the estimates. For estimates of species' presencelabsence, indicator kriging provides this information in terms of a probability. For estimates of %balacre, summary statistics such as standard deviation or inter-quartile range were calculated from the distribution of simulated values to describe the variation associated with the %ba/acre estimate for each cell.

Hershey, R. Riemann, M.A. Ramirez, and D.A. Drake. Using Geostatistical techniques to estimate the distribution and relative density of individual tree species in Pennsylvania. Unpublished report on file at USDA Forest Service, Northeastern Forest Experiment Station, Forest Inventory and Analysis Unit, Radnor, PA. e, methods are described in Rossi et al. 1993 and Isaaks and Srivastava 1989; the analysis was perjiormed using GSLIB routines (Deutsch and Journel1992) with some additional routines written by R.E. Rossi.

d) Tree species in Pennsylvania exhibit a high level of local variation as a result of natural environmental factors and land use histories. At the intensity of sampling present in the FIA sample data, much of this local variation cannot be modeled and effectively predicted in the interpolation process, but instead appears as variation that is unexplained by neighboring plots. However, such local variability is an important characteristic of the distribution of a species. Thus, we did not want this local variability to become hidden behind a regional average of the resource, but to remain as apparent and accessible to the user as possible in the final estimated dataset@). Sequential gaussian conditional simulation was the most effective of the interpolation methods at maintaining local variation. e) One feature of a well-designed sampling scheme is that it is sensitive to and can report, with an acceptable level of error, the characteristics of the phenomena of interest. In this ideal situation, the characteristics of the sample data represent reasonably well those characteristics of the phenomena itself. Every estimation technique honors and maintains different aspects of the original data. The specific goals of the interpolation task at hand will determine the priorities, but in general the more characteristics of the sample data that are preserved in the estimated dataset, the more desirable the dataset. Sequential gaussian conditional simulation again did the best job of maintaining both the univariate and bivariate characteristics of the sample data. f) As is true with many plant and animal populations, tree species have population distributions that are distinctly skewed toward younger individuals-more small trees than large mature ones. In addition, Pennsylvania, like most of the northeastern states, contains primarily mixed forests. Individual species rarely occur in pure stands. The 10 species examined in both studies included 8 of the most common by volume in Pennsylvania, and yet more than 50% of the time, when a species occurred on a plot it occurred as only a minor component (here defined as making up less than 20% of the total balacre on that plot). Both factors are combined in the %ba/acre 'relative importance' value, resulting in a highly skewed frequency distribution. Such extreme characteristics in the sample data can cause difficulties and biases when used with some of the interpolation methods that depend on assumptions about the normality of the distribution of the sample data (Isaaks and Srivastava 1989). One particular advantage of indicator kriging is that it makes no assumptions about the distribution of the data. Sequential gaussian conditional simulation, on the other hand, does assume that the data are normally distributed and stationary, and must be used more carefully. The sgCS routine used here, from Deutsch and Journel (1 992), performs a 1- 1, invertible normal-score transform on the data before running the simulation. In addition, however, the data also should be checked for binormality and a decision made as to whether to assume multivariate normality before the results of conditional simulation are accepted (Rossi et al. 1993).

The data exploration techniques used also proved invaluable to understanding the spatial characteristics of the specieslvariable being examined. Techniques included univariate analysis, variograms and other spatial dependence analyses,

and calculating local statistics. The resulting information was critical not only in determining what interpolation methods were most suitable and for checking the sample data for errors, but also for understanding the characteristics of the sample data and thus the phenomena being investigated. Geostatistical techniques offered ways to explore, organize, and summarize spatial patterns in the data that can provide clues to the variation and spatial behavior of the individual species under investigation. Applying the techniques to 10 species in Pennsylvania Because of the promising results from using geostatistical techniques for estimating the distribution of sugar maple from FIA data in the previous study, the same geostatistical methods were applied to nine additional species: red oak, white oak, chestnut oak, black oak, hemlock, red maple, beech, white pine, and yellow birch. This list includes 8 of the top 10 most abundant species in Pennsylvania by volume, and two species (yellow birch and white pine) that are much less common (Alerich 1993).

Tree species distribution is affected by many factors, including both environmental conditions and direct human influence through harvesting and other land use histories. As a result of being differentially affected by all of these factors, each species will exhibit different patterns and scales of spatial distribution. Some of these factors occur at scales much smaller than the sampling intensity of the FIA data, and some occur over larger areas, representing broadscale variation in the species distribution. In the previous study, it was found that a substantial amount of variation in sugar maple distribution was resolved at the sampling scale used for the FIA plots. This spatial dependence could, therefore, be modeled and used to support estimates of species occurrence and relative 'importance' (%ba/acre). The goal of this study is to examine to what extent this is true for the other species. More specifically, the objectives of this study are: a) if spatial dependence is exhibited at this intensitylscale of sampling, b) the resulting spatial distribution for each species, and how that compares to our current understanding, c) how the species differ from one another in terms of spatial dependence and distribution, and how that affects our ability to estimate them, and d) how to use the resulting estimated datasets. DATA

The sample data were collected by the Northeastern Forest Experiment Station's Forest Inventory and Analysis (FIA) unit. Basal area--the summed cross-sectional area at breast height--is calculated for all live trees 1.0 inches DBH or larger on the plot (Hansen et al. 1992). The data were for individual tree species, by basal area (ba) per acre as a proportion of the total basal area (% balacre). The data were accessed from individual tree records in the USFS Eastwide tree-level database and summarized as %ba/acre for each species by plot. In Pennsylvania, there were a total of 5,100 plots. Nonforested plots and those with total balacre equal to zero (due to missing data) were removed--leaving only 2,905 plots.

METHODS Each species was examined entirely independently. As in the previous study, the data for each species was organized, summarized, and explored using univariate statistics, measures of spatial dependence (variogram, covariance, and correlogram), and spatial distribution of local statistics across the state. All species were similar in many of their basic characteristics to each other and to the previously investigated species, sugar maple. Each exhibited extremely skewed distributions, with more than 50% of the plots containing less than 1%ba/acre in every species except red maple and red oak. A variogram was calculated for both the raw sample data and for a 1- 1, invertible normal-score transform of the sample data, using a lag distance of 500m and no directional component (anisotropy). In every instance, the variogram of the normal-scored data exhibited considerably more spatial dependence and structure than that of the raw data (Figure I), revealing spatial characteristics that were hidden by the strong univariate characteristics of the data. As sgCS uses normal-scored data, it was the model fitted to the normal-scored variogram that was used in the conditional simulation. An indicator variogram also was calculated and modeled for use in the indicator kriging. In general, there was far less structure and less of the variation explained in the indicator variogram (32 to 57%) than in the normal-scored regular variogram (35% and 64 to 97%) (Table 1). To assess how areas of 'local' variability in the sample data changed across the state, the mean and standard deviation were calculated for each of the 23,400 3000 x 3000m cells, using a 15 x 15km area as the window defining the size of the 'local' area. All species exhibited a proportional effect, with areas of high mean corresponding with areas of high local standard deviation, indicating a lack of stationarity. Using normalscored data seemed to largely eliminate this situation. Table 1. The percent variation explained by the spatial dependence in the variograms. Indicator Normal-scores Species variogram variogram Beech 46 76 Black oak Chestnut oak Hemlock Red maple Red oak Sugar maple White oak White pine Yellow birch

Indicator kriging and sgCS were run for each species, using models derived from the appropriate variograms. The estimation parameters of cell size (3,00Om), search radius (10,00Om), and minimum:maximum number of points used (1 :16) were taken directly from the results of the previous study. In sgCS, 50 simulations were run for each species.

0

25000 50000 75000 100000 distance (m)

0

25000

50000 75000 distance (m)

100000

Figure 1. Spatial dependence as demonstrated by the variogram of the raw data (left side) and variogram of the normal-scored data (right side) for white pine.

RESULTS With the exception of red maple, all species examined demonstrated substantial spatial dependence in the variogram of normal-scored data, with 64 to 97% of the variation explained by the visible structure and capable of being modeled. In some species, much of that spatial dependence was contained in a verylong-range trend of about 100,000m. White oak, black oak, chestnut oak, and beech all fell into this category. The rest of the species appeared to split the bulk of the explained spatial dependence over 2 ranges. For red oak, this was both a shortand medium-range pattern (12,000 and 40,000m); sugar maple a very short- and a long-range pattern (2,100 and 60,000m); and yellow birch a medium- and longrange pattern (19,000 and 80,000) (Figure 2). The spatial dependence exhibited i n the indicator variograms was much less, ranging from only 32 to 57% of the variation explained. This is not ideal for interpolation and suggests the necessity for hrther refinement. Z Q

1 T a1 beech 0.75

I

*= 0.5 .4 0 .l * 0.25 0

t------+ 0

25000

I

I

50000 75000 distance (m)

I

0

100000

0

25000

50000 75000 distance (m)

100000

25000 50000 75000 distance (m)

100000

asample data -model 1

5 (y0.75

94

0.5

2 * 0.25

.e3

0 0

25000 50000 75000 distance (m)

100000

0

Figure 2a-d. Variograms from the normal-scored data for four of the species.

The results create several output datasets. Figure 3a-d shows the results for beech, using 4 datasets to represent the species. Part (a) shows the estimated probability of beech occurrence, as calculated from indicator kriging. Part (b) is the median value of the 50 sgCS simulations, representing the chosen estimate of beech relative 'importance' in %ba/acre. The uncertainty associated with that estimate (here chosen to be a percentile range capturing approximately 2/3 of the

distribution) is described in (c) and (d). Part (c) expresses the minus variation, or that distance in %ba/acre values between the median and the bottom of that range (the 17th percentile), and part (d) expresses the plus (+) variation (83rd-50th percentiles). In every instance, the + variation is much greater.

Figure 3. Four estimated datasets describing the distribution of beech in Pennsylvania: a) the estimated probability of occurrence using indicator kriging, b) the median values from 50 sequential gaussian conditional simulations, c) the minus variation (median-17th percentile). and d) the plus variation (83rd percentile-median) about the median estimate.

DISCUSSION The probability that a species occurs is a unique and useful dataset. Any probability can be used as the cutoff to create a species presence/absence map depending on the objectives of the task at hand. If, for example, a particular insect is known to live in chestnut oak forests and the objective is to limit the search to only those areas where there is a high probability of finding suitable conditions, we might set the cutoff at the probability level of 2 .8. If, however, we are most interested in not missing any areas where the insect occurred, we might set the probability level for forest much lower, say 2.4.

accrptAWe

mdmte umxqmble

Figure 4. Chestnut oak occurrence as a major (>40%ba/acre), moderate (20-40%), and minor component (& 40%). Derived from the sgCS estimates.

The distribution of densities at which a species occurs also can be of considerable interest. For example, Figure 4 is a plot produced from sgCS estimates illustrating where chestnut oak occurs as a major, moderate, and minor component. The three categories have been defined as 1 to 20%ba/acre, 20 to 40%ba/acre, and >40%ba/acre. The %ba/acre estimate in Figure 4 can be associated with a corresponding uncertainty dataset (Figure 5) in which the uncertainty classes are broken into acceptable, moderate, and unacceptable. Again, the dataset itself is unclassed, and the user defines these--all the building blocks are provided in the output from sgCS. Alternatively, the same %ba/acre dataset(s) can be used to create a forest cover type 'map.' For this purpose, the summary statistic (whether mean, median, or another of the percentiles of each cell's simulated distribution) can be chosen specifically for the intended purpose in the same way the different levels of probability could be chosen when mapping species presence/absence from indicator kriging estimates. If the objective is to reduce the error of commission (i.e., classifying areas as sugar maplelbeech (SMIB) in the estimated map that really are not SMIB), then using a percentile at the lower end of each cell's distribution would be more desirable. If, however, the objective is to reduce the error of omission (i.e., missing areas that do contain SMIB), then using a higher percentile from the distribution, such as 75%, would be more desirable. The %ba/acre estimates of three species were further refined. Two species, white pine and yellow birch, nearly disappeared in the initial sgCS estimate even though they showed up as present in the probability of occurrence estimate from the indicator kriging. This was considered suspicious, so the dataset for each of these two species was divided into several different populations by region, and the process of variogram modeling and interpolation was repeated for each region. The regionalized variogram was substantially different from the average variogram, and when sgCS was performed using the locally tuned models, it revealed an entirely different and much more credible picture of white pine distribution in Pennsylvania. The third species, hemlock, was refined based primarily on the suspicion that small and large stand-size classes may have different spatial distribution patterns. When variograms were calculated separately for these two populations (using a cutoff of 45 years), they were indeed substantially different in shape, sill, nugget, and range. CONCLUSIONS

The estimated datasets output by indicator kriging and sgCS have the potential to be very useful. Every species examined, with the exception of red maple, exhibited substantial spatial dependence in the variograms of the normal-scored data, suggesting that there is considerable potential for the estimation of such datasets from FIA data. Each species exhibited some variety in spatial patterns and spatial dependence and may require different levels of additional fine-tuning, depending upon the objectives of the specific analysis and the time and expertise available.

Characteristics that make the datasets useful These techniques make explicit the uncertainties associated with an estimate in a form that can be incorporated when the data are used. This feature adds considerable utility and flexibility in the use of the resulting estimates, as the risk of errors of commission or omission can be specifically determined and manipulated to suit the current objectives. Maintaining individual species information separately allows considerable flexibility in the use of species distribution data. Instead of being limited to previously defined fixed classes, forest cover types can be uniquely defined to capture more accurately the habitat required for a particular study. The potential also exists to use one or more of the species datasets as a decision layer in the interpretation of satellite imagery. The two datasets offer complementary information about the species composition that really exists on the ground. The techniques used in this study are not extremely time-consuming nor difficult to process, and could be easily extended to additional species and states. There is a high level of variance associated with these estimates of %ba/acre--in many locations this variance can be as much as the estimate itself. However, the dataset nevertheless provides a very descriptive picture of species distribution at the state level. In comparison to previous depictions of current species distribution from FIA data by summarizing at the county level, this method provides a much more detailed picture of species occurrence and distribution. Although estimates could probably be improved and variances diminished by additional investigation into each species, the current estimates are informative and provide a useful basis from which to proceed. Clues to refining the interpolation and improving the estimates As was observed with white pine, yellow birch, and hemlock, it may be possible to significantly improve the estimates of %ba/acre by refining the analysis. When sub populations of a species have a significantly different pattern of spatial distribution, treating the populations separately in the interpolation will improve the final estimates. These populations may be described by regional land features or by some other defining characteristic (e.g., stand age for hemlock). The results suggest several clues to determine when this additional effort is necessary. First, when results of indicator kriging showed a species as present, but sgCS did not. Second, when the spatial dependence in the variogram is noticeably less than expected. This may suggest that there are different populations being lumped together that should be separated. For white pine, dividing the state into several broad geographic regions made a significant difference in the calculated variogram and thus in the final estimates. Another important clue is previous knowledge about the species that different ecological regions may have caused distinct spatial distribution patterns among tree species, or that different size class or age populations may have different spatial distribution patterns over the landscape. Hemlock is an example of the latter. As a result of past management practices that involved heavy harvesting of large hemlock for the tanning industry, today there are often relics of large individuals among a relatively wider

distribution of smaller, younger trees that have grown up in the interim (Hough and Forbes 1943, Powell and Considine 1982). These datasets of individual species distribution do not contain any of the finescale foresthonforest detail. If such information is desired, more detailed datasets describing the forestlnonforest land cover in Pennsylvania would have to be derived from a more intense point sample or the continuous but averaged data available from satellite imagery. Such detailed datasets could be used as a 'mask' and overlaid on any of the datasets of species distribution. There are more possibilities for applying geostatistical techniques than have been investigated here. For example, some species may exhibit some correlation with a particular soil, climate, topography, or reflectance data from satellite imagery. Indicator kriging and sgCS, in particular, allow the incorporation of such ancillary, 'soft' information to contribute to the estimation process. REFERENCES Alerich, C.L. 1993. Forest Statistics for Pennsylvania--1978 and 1989. Resource Bulletin NE- 126. USDA Forest Service, Northeastern Forest Experiment Station. Radnor, PA. 244p. Deutsch, C.V. and A.G. Journel. 1992. GSLIB: Geostatistical Software Library and User's Guide. Oxford University Press, New York. Hansen M.H., T. Frieswyk, J.F. Glover, and J.F. Kelly. 1992. The Eastwide Forest Inventory Data Base: Users Manual. General Technical Report NC- 151. USDA Forest Service, North Central Experiment Station. St. Paul, MN. Hough, A.F. and R.D. Forbes. 1943. The Ecology and silvics of forests in the High Plateaus of Pennsylvania. Ecological Monographs. 13:299-320. Isaaks, E.H. and R.M. Srivastava. 1989. An Introduction to Applied Geostatistics. Oxford University Press, New York. Powell, D.S. and T.J. Considine. 1982. An analysis of Pennsylvania's forest resources. Resource Bulletin NE-69, USDA Forest Service, Northeastern Forest Experiment Station. Broomall, PA. 97p. Rossi, R.E., P.W. Borth, and J.J. Tollefson. 1993. Stochastic simulation for characterizing ecological spatial patterns and appraising risk. Ecological Applications. 3(4):719-735. BIOGRAPHICAL SKETCH Rachel Riemann Hershey is a foresterlgeographer with the Forest Inventory and Analysis Unit, Northeastern Forest Experiment Station. She received a B.A. in ecology from Middlebury College, an M.S. in forestry from the University of NH and an M.Phil. in geography from the London School of Economics.