Water Qual. Res. J. Canada, 2005
•
Volume 40, No. 3, 347–360
Copyright © 2005, CAWQ
Considerations when Using the Reference Condition Approach for Bioassessment of Freshwater Ecosystems Michelle F. Bowman1,2* and Keith M. Somers2,1 1
Department of Zoology, University of Toronto, 25 Harbord Street, Toronto, Ontario M5S 3G5 2 Dorset Environmental Science Centre, Ontario Ministry of the Environment, 1026 Bellwood Acres Road, Dorset, Ontario P0A 1E0
The use of the reference condition approach (RCA) in environmental assessments is becoming more prevalent. Although the RCA was not explicitly described in Green’s (1979) book on statistical methods for environmental biologists, we expanded his decision key for selecting an appropriate environmental study design to include this approach. The RCA compares the biological community at a potentially impacted ‘test’ site to communities found in minimally impacted ‘reference’ sites. However, to implement the RCA there are a number of assumptions and decisions that must be made. We compare several common multimetric and multivariate bioassessment methods to illustrate that four key decisions inherent in the RCA framework (i.e., criteria used for reference site selection, for grouping similar reference sites, for comparing test and reference sites, and for evaluating the cause of impacts) can markedly affect test site appraisals. Specific guidelines should be developed to select appropriate reference sites. Based on analyses of real and simulated data, we recommend a minimum of 20, but preferably 30 to 50 reference sites per group, and verification of groupings with more than one classification method. New approaches (e.g., test site analysis) incorporating the strengths of both multimetric and multivariate methods can be used to compare test and reference sites. Additional ecological information, models relating degree of impact to a stressor or habitat gradient, and variance partitioning can also be used to isolate the probable cause of impairment, and are particularly valuable when appropriate reference sites are unavailable. Key words: reference condition approach (RCA), bioassessment, site classification, test site analysis (TSA), stressor identification
Experimental Designs for Environmental Studies
In many instances, impacts have occurred before environmental studies are initiated. If the impact has occurred and ‘when and where’ are known, the impact can be inferred from spatial patterns (Fig. 1A). For example, if a control area is available, a control-impact (C-I) comparison can be used (e.g., upstream-downstream contrast), and if a control area is not available, the biological response to varying levels of impact can be used (e.g., downstream gradient design) to evaluate the impact (Fig. 1B). However, evaluating impacts with spatial patterns is suboptimal because differences among areas may have existed before the supposed impact occurred (Chapman 1999). When the only information available is that the impact has occurred, the purpose of study becomes to determine ‘when and where.’ As Green notes, the ‘when and where’ design is used in the worst possible situations, and the goal of this type of study is to establish that the impact is due to anthropogenic rather than natural causes. If representative reference ecosystems rather than control areas within the system of interest are used as benchmarks, the ‘when and where’ branch of Green’s decision key can be expanded to include the reference condition approach (RCA, Fig. 1B). It is important that the reference ecosystems are appropriate controls for the potentially impacted area. If there are no appropriate reference ecosystems available (e.g., large lakes or rivers),
Green (1979) classified experimental designs for environmental studies into five general categories depending on whether the timing and location of an impact are known, and whether a reference area is available for comparison (see Fig. 1A). Whenever possible, data should be collected before an impact occurs. The most rigorous before-after, control-impact (BACI) study design requires that the impact has not occurred before the study begins, that ‘when and where’ the impact will occur are known, and that there is a comparable control or unexposed area in the ecosystem of interest (but also see Underwood 1997). When a spatial control area is lacking, the effects of an impact must be inferred from temporal changes alone (i.e., a before-after [or B-A] comparison). The B-A design is suboptimal because temporal changes may have also occurred in an unexposed area, had there been one available for comparison (e.g., Chapman 1999). A monitoring study can be conducted when we know the impact has not yet occurred, but we do not know when or even if it will occur (Green 1979). In the case of non-point source pollution (e.g., long-range atmospheric transport, groundwater contamination, agricultural land use), we may also not know where an impact will occur. * Corresponding author;
[email protected] 347
348
Bowman and Somers
Fig. 1. Decision key (A) to categorize the optimal before-after, control-impact (BACI) and suboptimal (temporal, monitoring, spatial, ‘when and where’) environmental study designs described by Green (1979), and (B) expanded to include types of spatial, ‘when and where’ and monitoring studies (underlined study designs were not specifically described by Green [1979]).
the modern analog approach can be used (i.e., historical information stored in sediment layers as in paleolimnology, tree rings, ice cores [Walker 1993, Paterson et al. 2002]). Although the phrase “reference condition approach” usually refers to studies where comparable ecosystems are used as benchmarks for comparison, all environmental study designs use spatial and/or temporal approximations of the reference condition. Thus, the focus of environmental study designs is to determine if the supposed environmental impact in question existed at times before the area was exposed to human activities, or at locations that were never exposed. The four main steps in the RCA include selecting minimally impacted reference sites, grouping reference sites into biologically similar types, comparing a potentially impacted test site to reference condition, and diagnosing the cause of an impact (e.g., Bailey et al. 2004, Fig. 2). When using the RCA, biological, chemical and physical attributes of minimally impacted reference ecosystems are characterized. Chemical and physical features (e.g., stream order) not likely to be affected by anthropogenic activities that can be used to separate biologically different reference-community types are identified (Hawkins et al. 2000). These discriminatory chemical and physical para-
meters are used to assign a potentially impacted test site to the most similar reference group(s). The biological attributes of the test site of interest are compared to reference site communities to determine if the test site is also a good match biologically. Assuming appropriate reference sites are used, differences in the biological attributes of a test site relative to those of reference sites imply impairment. The final step in the RCA is to determine why the biological community at an impacted site is impaired.
History of the Reference Condition Approach The RCA has evolved from a number of different bioassessment programs with similar objectives. Although numerous examples predate Green’s (1979) book on experimental designs, the unifying challenge identified by Green was to conduct a bioassessment when the impact had already occurred and ‘when and where’ were not known. The original study objectives that led to the development of the national biomonitoring program in the U.K. were to develop a biological classification system for unpolluted waters based on macroinvertebrate fauna, and to assess whether the fauna could be predicted from environmental features
Using the RCA in Bioassessments
349
Fig. 2. Common steps used in the bioassessment of a potentially impacted test site using the reference condition approach (unweighted pair-groups method using arithmetic averages [UPGMA], index of biological integrity [IBI], test site analysis [TSA] and Australian River Assessment System [AusRivAS]).
(e.g., Moss et al. 1987, Wright 2000). The program in the U.K. was the basis for similar multivariate bioassessment programs in Australia (Davies 1994) and Canada (Reynoldson et al. 1995) wherein minimally impacted reference sites are used to set benchmarks for the assessment of potentially impacted test sites. Although the use of reference benchmarks is also the basis for bioassessment programs in the U.S. (Hughes et al. 1986), the methods evolved independently and are thus somewhat different than in the U.K.-type model. In most U.S. bio-
monitoring programs, a number of summary biological indices are used in a multimetric (rather than multivariate, Fig. 2) approach (Barbour et al. 1995, 1999). Below we briefly describe these common multimetric and multivariate approaches to illustrate the key decisions required when using the RCA (Table 1). Although minimally impacted reference sites are used to establish a reference benchmark in all approaches, the methods used to group reference sites, and to compare reference and test sites vary among approaches.
Ecological significance
Analysis
Number of reference sites and groups
Observed/expected taxa occurrences and pollution tolerances 3 to 6 bands of “condition” 1st to 5th (tolerance) or 10th (occurrence) of O/E score distributions
a posteriori classification using TWINSPAN clustering of taxa presence-absence (later also used log abundance categories) and a DFA to select environ. variables 614 sites, 35 groups mean 18 sites/group (range 6–39)
genus/species
Taxonomic resolution Classification method
Subsampling
~30 total, map derived (e.g., altitude), historical (e.g., discharge), site (e.g., substrate, water chemistry) 2h
Environmental variables measured
Net size (µm) Collection effort
Metric selection, calibration and aggregation into an index, e.g., 5 bands of condition e.g., >75th percentile of aggregate score at reference sites
a priori classification based on physiochemical or biogeographical attributes, confirmed with e.g., species composition
500 (D net) Single habitat – 2 m2 Multi-habitat – 20 jabs or kicks Stream type, watershed features, riparian vegetation, instream features, woody debris, aquatic vegetation, substrate Minimum of 100 organisms (pan with grid) Family or genus/species
Riffles and runs or proportional Varies
In proportion to occurrence Spring, summer and fall (combined) 900 (D net) 3 min, kick and sweep
Habitat(s) Season(s)
Multimetric (e.g., Barbour et al. 1999) e.g., No alterations in organic matter availability, water quality (e.g., temp., O2), habitat structure, or flow regime
RIVPACS (e.g., Wright et al. 2000)
Reference site selection criteria
Consideration
TABLE 1. Considerations when using the reference condition approach
200 organisms (Marchant box) Genus/species
5 landscape, 10 site/reach, 16 channel/substrate, 9 water column
400 (D net) 3 min
Riffles Fall
Minimal logging, mining, agriculture, flow modification or urbanization
BEAST-Fraser River (e.g., Bailey et al. 2004)
88 to 94 sites, 4 to 5 groups per season mean 18 to 24 sites/ group (range 5–37) Observed/expected taxa occurrences and pollution tolerances 5 bands of “condition” >90th or 50 m upstream and >300 m downstream of a weir, dam, ford, waterfall >5 m, significant discharge or confluence, livestock watering, flow diversion, channelization, dredging or weed removal Edge and pool treated separately Spring and fall (separately and pooled) 250 (D net) 10-m transect
AusRivAS-A.C.T. (e.g., Bailey et al. 2004)
350 Bowman and Somers
Using the RCA in Bioassessments
RIVPACS/AusRivAS The national River InVertebrate Prediction And Classification System (RIVPACS) developed in the United Kingdom provides site-specific predictions of the macroinvertebrate fauna expected to occur in the absence of anthropogenic stress (e.g., Wright et al. 2000). A similar bioassessment program, the Australian River Assessment System (AusRivAS), has also been widely implemented (e.g., Davies 2000). Although the type of clustering method used to classify reference sites into groups differs in RIVPACS and AusRivAS (Table 1), in both approaches the probability that a given taxon occurs at a site is estimated using:
351
benthic community. In the BEAST approach, test sites outside the 99% confidence ellipse for a given reference site group (in ordination space) are considered different from reference, and the difference between the test site and reference condition is determined by evaluating which taxa are highly correlated with the axes scores.
Regression-Based Approaches
The assessment of a test site is based on the overall ratio of observed to expected (O/E) taxa (i.e., O/E scores). In addition, O/E scores weighted with pollution tolerances of taxa are often used. To determine the potential cause of an impact, the observed taxa (or taxa tolerance) list is compared to expectations.
In addition to the standard approaches, several regressionbased, RCA-type methods have been proposed. Rather than grouping reference sites, Bailey et al. (1998) used environmental covariates to model variation among all of their reference sites before projecting test sites onto the regression. Verdonschot (1995) used a combination of grouping and regression with a type of canonical correspondence analysis (CCA) to characterize benthic assemblages in streams in The Netherlands. The CCA approach is commonly used in paleolimnology where historical data representing predisturbance conditions are used to evaluate and predict recovery of assemblages that have been impaired (Dixit et al. 1992). Logistic regression models have also been used to model impaired benthic communities along environmental gradients (e.g., Kennen 1999).
Multimetric
Unanswered Questions in the RCA
Unlike the other methods described here, a priori rather than a posteriori reference site classifications are usually used in the multimetric approach (Table 1, Fig. 2). Thresholds separating scores for unimpaired and impaired communities are developed for each index and each stream type (e.g., Barbour et al. 1995). The richness (e.g., number of taxa), composition (e.g., % chironomids), tolerance (e.g., Hilsenhoff Biotic Index) and trophic/habitat (e.g., % filterers) metrics that best discriminate between reference and degraded sites are combined into an aggregate score and test sites are evaluated based on their aggregate score relative to scores expected from reference sites. The responses of various metrics to a variety of disturbance gradients have been well documented (e.g., Barbour et al. 1999), and are therefore valuable in diagnosing the cause of an impact.
Although the RCA has been used to assess anthropogenic impacts in aquatic ecosystems for some time, an evaluation using the RCA still requires several subjective decisions. That is, best professional judgment is required to select minimally impacted sites, group reference sites, determine the criteria used to distinguish minimally impaired from significantly impaired test sites, and to infer why the community in an impacted site is impaired (Fig. 2). Through literature review and analyses of real and simulated data, we illustrate below that these four decisions in the RCA framework can markedly affect the outcome of an evaluation. Because these decisions will vary with the objectives of a given evaluation or program, our goal is to highlight these issues and provide guidance so more effective and objective decisions can be made in the future.
BEAST
What is Minimally Impacted?
The BEnthic Assessment of SedimenT (BEAST) approach was originally developed to assess nearshore areas of the Laurentian Great Lakes (Reynoldson et al. 1995) and was later used to assess stream condition in the Fraser River basin (e.g., Resh et al. 2000). The method used to group reference sites in BEAST is the same as in AusRivAS; apart from taxa abundance data being used in place of taxa presence-absence data (Table 1). However, the relative proximity of test and reference sites is evaluated in a subspace defined by a multivariate ordination of the
Although selecting minimally impacted sites is the basis for the RCA, there is no accepted method for identifying minimally impacted reference site conditions. For example, a general guide to reference site selection includes defining study boundaries, establishing criteria for stratification (e.g., ecoregion), collecting local knowledge, and iteratively confirming or rejecting each candidate site (Reynoldson and Wright 2000). More specifically, Hughes et al. (1986) suggested steps in reference site selection should include: (1) elimination of watersheds with
n
Σ
(Probability test site is in reference group x) ⋅ (1) (Frequency of taxon in reference group x)
x=1
352
Bowman and Somers
significant human disturbances, (2) selection of sites with comparable watershed area and annual stream discharge as potentially impacted sites, (3) selection of stream types most typical of the region (e.g., stream gradient), (4) consideration of protected areas or refuges, (5) consideration of known zoographic patterns, and (6) ranking of candidate sites by level of disturbance. Similarly, Davies (1994) suggested potential sampling locations can be separated into least, moderately and most disturbed conditions using local knowledge and evidence of 11 main categories of human impact (e.g., forestry), and best available reference sites (with like stream order, altitude, climatic region, geology as most disturbed sites) can be selected from least and (if necessary) moderately disturbed categories (see Bailey et al. 2004). The consensus among international scientists was that identifying high-quality sites is not always straightforward, and a more formal process should be adopted for the selection of reference sites (Reynoldson and Wright 2000). The commonalities among various reference site selection methods include establishing spatial and temporal study boundaries, ensuring reference stream types are consistent with potentially impacted stream types, ranking quality of potential reference sites, and cross-checking with local knowledge and site visits (Fig. 2). These commonalities provide a foundation for the development of reference-site selection guidelines. In situations where no suitable reference sites exist (e.g., regional or global stressors, large lakes or rivers impacted by multiple stressors), proxy data, additional ecological information, modelling, or variance partitioning can be used to help assess whether a test site has been impacted. Paleolimnological techniques may be the only option when assessing the overall impact of regional or global stressors such as eutrophication, organic or metal contamination, acidification, or climate change (Dixit et al. 1992, Walker 1993). Additional proxy (i.e., the modern analog approach, Paterson et al. [2002]), biological (e.g., pollution tolerances of taxa) or habitat (e.g., apparent water quality or physical degradation) information will aid in impact assessment. Models relating supplementary information to biological attributes can be used to help: (1) predict the effects of rehabilitation or additional stress (Fig. 3A), (2) compare sites with different habitat characteristics (Fig. 3B), or (3) assess the condition of a test site without a reference benchmark (e.g., Fig. 3B; see Bailey et al. [1998, 2004] for further explanation). When trying to separate impacts associated with more than one stressor, it is important to measure or predict biological attributes of the reference condition because impacts may be cumulative (Fig. 3C), distinct (Fig. 3D) or synergistic (Fig. 3E). When lack of empirical data precludes modelling biological responses to stressor(s) of interest, redundancy analysis (e.g., Legendre and Anderson 1999) can be used to partition the biological variability associated with various stressors (Fig. 3E). Consideration of supplementary ecological information,
development of biological response models (i.e., as in Bailey et al. 1998), and partitioning biological variation can also be used to determine why a test site is impaired, when reference sites are unavailable (Fig. 2).
How Many Reference Site Groups? Although study objectives and data availability will influence the number of reference site groups, classification methods that generate too many or too few reference groups can result in inaccurate and imprecise biological assessments of test sites. Grouping dissimilar ecosystem or community types in a RCA assessment may reduce the ability to detect impacts. However, using too many reference site groups results in narrow relevance of study results, and can lead to poor descriptions of any one ecosystem type (see the section How Many Sites Within Each Reference Group? below). Reference site groups can be identified either a priori using habitat characteristics of reference sites, or a posteriori using biological characteristics (Fig. 2). Analysts who use a multimetric bioassessment approach have tended to use a priori classification of reference sites (e.g., VanSickle and Hughes 2000), whereas analysts who use a multivariate approach have tended to use a posteriori classification (e.g., Bailey et al. 2004), but a priori classification can be used with multivariate assessments and vice versa (Barbour et al. 1999). In either approach, it is important that the environmental variables used to distinguish community types will not be altered by human activities (e.g., catchment area, slope, geology). The premise underlying the a priori, ecoregion approach to grouping reference sites is that relatively homogeneous biological communities are expected to occur in areas with distinct sets of characteristics, and relatively homogeneous areas can be delineated by simultaneously analyzing causal and integrative factors including land surface form, soils, land use and natural vegetation (Omernik 1987). Ecoregions may be subdivided further (e.g., by elevation) when there is considerable heterogeneity within a region (e.g., Barbour et al. 1999). For an ecoregion approach to be used effectively, detailed, largescale maps of habitat characteristics (e.g., soils, vegetation) must be available, and hypotheses of the location of ecoregion boundaries should be tested and refined (Omernik 1987; VanSickle and Hughes 2000). In the alternative a posteriori approach, the principle underlying the a priori approach is reversed such that reference sites with similar biological communities (rather than similar habitat characteristics) are grouped, and habitat characteristics that separate community types are identified. Ordination and clustering methods are used to define groups in the a posteriori approach, and subsequent discriminant functions analysis (DFA) is used to identify environmental variables best able to separate biological community types. Clustering can be
Using the RCA in Bioassessments
353
Fig. 3. An illustration of (A) the use of models to estimate reference condition or additional stress, (B) comparing biological condition at sites with different habitat characteristics (e.g., stream order), or assessing the degree of impact at test sites in the absence of suitable reference sites, using models, (C) a cumulative biological response to stressors, (D) a distinct biological response to stressors, and (E) partitioning the biological variance associated with various stressors.
done by: (1) joining sites and groups of sites until all sites are in one group (i.e., agglomerative hierarchical clustering, e.g., unweighted pair-groups method using arithmetic averages [UPGMA] as in AusRivAS or BEAST, or Ward’s method), (2) splitting one group into smaller and smaller groups until each group is a single site (i.e., divisive hierarchical clustering, e.g., two-way indicator species analysis [TWINSPAN] as in RIVPACS), or (3) starting with one site and clustering other similar sites, allowing sites to be reassigned to groups during clustering (i.e., nonhierarchical clustering; e.g., K-means clustering; Quinn and Keough [2002]). Legendre and Legendre (1998) present thorough descriptions of numerous clustering methods (including the methods listed here), provide examples of their use in ecology, and summarize the pros and cons for each method. To illustrate that different grouping methods typically produce somewhat different results, we analyzed
benthic macroinvertebrate (BMI) abundance data collected annually for nine years from four headwater streams in forested catchments near Dorset, Ontario (see Bowman et al. [2003] for detailed description), using two ordination methods (correspondence analysis [CA], non-metric multidimensional scaling [NMDS]), and four clustering methods (TWINSPAN, UPGMA, Ward’s, Kmeans). For each ordination, the first two axes are plotted, and the 90% confidence ellipses for the nine observations (i.e., years) for each stream are used to show the variability within each group (Fig. 4A,B). The clustering results are also summarized with 90% confidence ellipses enclosing the sites assigned to each cluster group in the original CA plot (Fig. 4C,D,E,F). The ordinations showed BMI communities in streams C and D were similar, but in general, BMI communities within a given stream over time were more similar to one another than BMI communities among streams
354
Bowman and Somers
Fig. 4. The 90% confidence ellipses for biological communities (from Stream A, B, C and D) grouped by (A) stream in a correspondence analysis (CA) ordination, (B) stream in a non-metric multidimensional scaling (NMDS) ordination, and groups in a CA based on clustering with (C) two-way indicator species analysis (TWINSPAN), (D) unweighted pair-groups method using arithmetic averages (UPGMA), (E) Ward’s and (F) K-means methods with a 4-group cutoff.
(Fig. 4A,B). When TWINSPAN was used to cluster reference communities, Group 1 contained primarily Stream A communities (with the exception of D8), Group 2 contained primarily Stream B communities (with the exception of C4 and D4), Group 3 contained just three communities (B1, B4, C1), and Group 4 contained the remaining Stream C and D communities (Fig. 4C). The results of the UPGMA, Ward’s and K-means clustering were similar: Stream A, B, C and D communities were generally assigned to Group 1, 2, 3 and 4, respectively (with 0–3 exceptions in each instance; Fig. 4D,E,F). Overall, TWINSPAN, UPGMA, Ward’s and K-means clustering resulted in groupings by stream for 67, 92, 89 and 81% of communities, respectively. Consistent with our results, Moss (2000) compared the same four clustering methods using RIVPACS data and found the percentage of sites correctly allocated to groups with DFA were similar (57–66%) among methods. Our results also showed that the shape and size of the 90% confidence ellipses changed considerably when at least one BMI community was placed in a group different than in the original CA. For example, when site B4 was assigned to Group 3 (rather than Group 2) in both the UPGMA and
Ward’s clustering, the size of the 90% confidence ellipse for Group 3 approximately tripled (Fig. 4D,E). These types of inaccuracies arising from different clustering approaches may affect BEAST-type analyses. However, misclassification in reference site groupings will alter the description of community types and consequently affect the accuracy of all RCA-type bioassessment methods. We randomized and re-analyzed the stream benthic macroinvertebrate data set described above to demonstrate clustering methods will assign sites to groups regardless of whether true groups exist. Each column in the taxa abundance (column) by site/year (row) data set was randomly shuffled. Ordinations of randomized data showed there were no clear site groups (Fig. 5A,B). However, UPGMA clustering of randomized data suggested there were two groups and four outliers (Fig. 5C), and Ward’s clustering suggested there were four groups (Fig. 5E). We used 90% confidence ellipses around sites within groups in CA space to show the degree of overlap among groups (Fig. 5D,F). There was a high degree of overlap among the four groups established in Ward’s clustering (Fig. 5F), suggesting there were no true groups. However, without a priori knowledge, the UPGMA clusters of ran-
Using the RCA in Bioassessments
355
Fig. 5. The 90% confidence ellipses for randomized biological community data (shown in Fig. 4) grouped by (A) stream in a correspondence analyses (CA) ordination, and (B) by stream in a non-metric multidimensional scaling (NMDS) ordination, (C) unweighted pair-groups method using arithmetic averages (UPGMA), (D) UPGMA clustering in CA space, (E) Ward’s method, and (F) Ward’s clustering in CA space, with a 4-group cutoff.
domized data would suggest legitimate groupings (Fig. 5D). Thus, groupings based on clustering methods should be confirmed with other clustering or ordination methods, a priori predictions, or alternative methods (as recommended by Green [1979]). Similarly, Davies (1994) recommended that UPGMA clustering be used to group reference sites in AusRivAS, but that ordination and TWINSPAN also be used as supplementary aids in identifying groups. However, the issue of considerable site classification error and what to do if classifications do not agree remains.
How Many Sites Within Each Reference Group? Irrespective of the subsequent method of analyses used to compare reference and test sites, too few reference sites per group can result in an inaccurate description of biological community types, and thus, inaccurate test site evaluations. We used simulated data to estimate how many sites are required to adequately describe a group. First, we simulated data that emulated a standardized (mean = 0, standard deviation = 1) biological community
descriptor (e.g., % chironomids, ordination axis scores) at 7500 sites. We assessed changes in the accuracy and precision of the estimated mean and standard deviation with increasing numbers of sites (n = 5, 10,…, 250) per group. For example, the group mean and standard deviation were calculated using five randomly selected data points and the calculations were repeated 30 times (each data point was used only once). We then simulated four additional data sets that each contained two sets of 7500 standardized observations (with mean = 0, standard deviation = 1) such that the Pearson correlation between the two variables in the data sets was 0.0, 0.3, 0.7 and 0.9, respectively. Knowing the actual correlations between variables, we were able to assess changes in the accuracy and precision of estimates of correlation with increasing numbers of sites per group. In assessments using a single biological metric (e.g., RIVPACS/AusRivAS), it is important to accurately estimate group mean and variance. It is also important to accurately estimate the correlation or covariance among metrics (see the section How to Compare Reference and Tests Sites? below) when two or more biological metrics are used (e.g., multimetric, BEAST).
356
Bowman and Somers
Our examination of changes in the accuracy and precision of estimates of mean, standard deviation and correlation with increasing sample size suggests the common recommendation of at least five sites per group (e.g., Reynoldson and Wright 2000) may be insufficient for adequate descriptions of reference sites groups (e.g., Barrett and Goldsmith 1976). Analyses of simulated data showed estimates of mean and standard deviation based on five sites were inaccurate and imprecise (Fig. 6). The precision in estimates of mean and variance increased with increasing sample size but reasonable precision (±0.5) was obtained with sample sizes ≥20. At a given sample size, greater accuracy and precision in estimates of correlation were associated with larger correlations between variables (Fig. 7). Reasonable levels of accuracy and precision were obtained with 10 to 20 sites per group when the correlation between variables was 0.7 to 0.9, but 40 to 50 sites per group were needed for similar accuracy and precision where r = 0.0 to 0.3. Since imprecision in each step of a RCA study presumably results in imprecision in the bioassessment of test sites, it is advisable to err on the side of caution. Thus, we suggest using no less than 20 but preferably 30 to 50 sites per group in RCA bioassessments.
What is an Ecologically Significant Effect? To evaluate whether the community at a test site is characteristic of reference site communities, the effect size (i.e., difference among communities) considered meaningful should be defined a priori. In a traditional hypothesis test, we would evaluate whether the difference between the test community and reference community mean is sigFig. 7. Box (25th percentile, median, 75th percentile) and whisker (minimum and maximum) plots of the correlation between two simulated biological attributes with increasing numbers of sites per group when the correlation between variables is (A) 0.0, (B) 0.3, (C) 0.7 and (D) 0.9.
Fig. 6. Box (25th percentile, median, 75th percentile) and whisker (minimum and maximum) plots of the group (A) mean and (B) standard deviation of a simulated biological attribute estimated with increasing numbers of sites per group.
nificantly different from zero. However, in the context of a RCA, statistically different from zero is likely insignificant ecologically (McBride et al. 1993). We are instead interested in whether the biological condition of a test site falls outside the range of conditions characteristic of reference sites. Thus, Kilgour et al. (1998) proposed that an ecologically significant difference between a test community and reference conditions would be significantly greater than the normal range of variation among reference communities, and defined normal range as the confidence region enclosing 95% of reference communities. Similarly, significant impairment is defined as anything outside the 75th to 99th percentile, in many RCA-type bioassessment approaches (Table 1). When setting condition thresholds, it is widely understood that there is a trade-off between the chances of incorrectly classifying a site in reference condition when it is in fact impacted, and misclassifying a truly unimpacted site as impacted (i.e.,
Using the RCA in Bioassessments
type-two versus type-one errors in hypothesis testing; Osenberg et al. [1994], Power et al. [1995]). However, less often acknowledged is the fact that condition thresholds or bands are set arbitrarily (e.g., Norris 1995). Thus, more attention should be directed at evaluating whether condition thresholds are indicative of ecologically meaningful differences among biological communities.
How to Compare Reference and Test Sites? As a result of diverse influences, objectives and circumstances, the common approaches used to compare reference and test sites (Table 1) have each evolved with distinct advantages and shortcomings. Thus, it is not surprising that there are ongoing disputes concerning the dilemma of selecting the most appropriate data analysis for test site evaluations (e.g., Gerritsen 1995; Norris 1995; Resh et al. 2000). The primary advantage of using the multimetric approach is that current knowledge of the relationships between structure and function of aquatic ecosystems is incorporated into the metrics (e.g., Barbour and Yoder 2000). One criticism of the multimetric approach is that correlations among metrics are not accounted for, and can result in inaccurate test site evaluations (Norris 1995; Reynoldson et al. 1997). The use of correlated or redundant metrics in the overall site score
357
can result in test sites being deemed more or less impacted than they actually are. More difficult to detect however, is the problem of a test site community that is within the range typical of reference sites for more than one metric but is still different from what is typical of reference sites when the correlations are considered (e.g., Fig. 8). As a result, methods that do not account for correlations among metrics can have less power to detect true impacts, and a multivariate assessment that accounts for correlations among metrics is warranted whenever several metrics are evaluated (e.g., Norris 1995). Although a multivariate framework may be appropriate for the inherently multivariate nature of bioassessment data, there are unresolved weaknesses in each type of multivariate approach. It has been well documented that the RIVPACS and AusRivAS approaches can effectively detect impacted sites on a national scale (e.g., Wright et al. 2000) but it is widely recognized that future evolution of O/E approaches should include nonrichness metrics that reflect other structural and functional aspects of biological communities (e.g., Reynoldson and Wright 2000). Identification of the cause of stress in an O/E approach requires subjective interpretations of the difference between observed and expected taxa, and local knowledge (Wright 2000), and cannot be used to predict changes in biota as a result of Fig. 8. An example of a reference data set showing (A) mean ± standard error (dotted lines) and standard deviation (solid lines) for two biological variables, with a (B) test site and (C) discriminant axis, and the associated histograms for (D) variable one, (E) variable two, and (F) the discriminant axis scores, standardized (mean = 0, standard deviation = 1).
358
Bowman and Somers
changes in environment (Armitage 2000). In the BEAST approach, changes in community structure can be detected, but the sensitivity of confidence ellipses to reference site groupings (see the section How Many Reference Site Groups?), low power to detect impairments, and test sites that fall close to thresholds may be problematic. Similarly, all multimetric and multivariate methods discussed use subjective bands of condition to convey the degree of impairment to non-experts (Simpson and Norris 2000) and the likelihood of misclassifying test sites that fall close to these thresholds is high. Further development of multivariate methods used to compare test and reference sites should include the incorporation of metrics, significance testing (in place of banding) and better diagnostic capabilities (Norris 1995; Wright et al. 2000). Test site analysis (TSA). A new method to compare test and reference sites was developed specifically to incorporate significance testing into bioassessments but the approach also incorporates the ecological information conveyed in metrics, and a more direct means of impact diagnosis, in a multivariate framework (Bowman et al. 2003). In the TSA approach, the biological condition of a test site and comparable reference sites are evaluated using a number of summary metrics. Any combination of traditional (e.g., % chironomids), multivariate (e.g., ordination axes scores) or functional (e.g., % filterers) metrics can be used (provided that the number of metrics is not greater than the number of sites). These metrics are the raw data used in a series of multivariate calculations that can easily be performed in a simple spreadsheet (Bowman et al. 2003). The resultant overall, generalized distance (or Mahalanobis distance; e.g., Legendre and Legendre [1998]) between the test site and reference condition accounts for correlations or redundancies among metrics, and is used to statistically evaluate the condition of the test site. Non-central significance (i.e., interval and equivalence) tests are used to determine whether the biological distance between a test site and reference condition is greater than a user-specified effect size. For example, the non-central interval test provides a single probability (p) that the test site is outside the normal range (95%) of reference sites (Kilgour et al. [1998]; see the section What is an Ecologically Significant Effect?). If a test site is statistically different from reference condition (i.e., p < 0.05), additional calculations (i.e., DFA and partial DFA) are used to identify the metrics that were important in separating the test site from reference condition. Hence, the TSA approach provides an objective diagnosis of potential causes of impairment that cannot be detected by evaluating each metric or taxon individually. A test site that is not statistically different from reference condition is further categorized as ‘in reference condition’ (p > 0.95) or ‘possibly impaired’ (0.05 < p < 0.95). The TSA approach has been used to identify whether, to what
degree, and how stream benthic macroinvertebrate communities were impaired by acidification (Bowman et al. 2003). The TSA approach integrates strengths of multimetric and multivariate bioassessment approaches with hypothesis testing, can be tailored to suit varying objectives, provides unambiguous results without specialized software enabling use by experts and non-experts, and improves our ability to diagnose the cause of impairments in biological communities.
Conclusions Although the RCA is generally used in the worst-case situations and includes methodological weaknesses, there are benefits to the approach that rival the benefits of more traditional environmental study designs. There is no universal method for identifying minimally impacted reference sites but commonalities among existing guidelines provide a strong basis for site selection. Arguably the most serious weakness in the RCA is the current methodology used to group reference sites. A combination of environmental characteristics used to separate theoretical (e.g., sub-ecoregions) and empirical (e.g., biological community types) groups will likely result in more accurate reference site selection. Similarly, incorporating strengths of various methods (e.g., multimetric, multivariate) to compare test sites to the reference condition will improve the accuracy and value of test site evaluations. Goals of future research should focus on improving our ability to identify ecologically meaningful changes in biological communities, and to diagnose the cause of significant impairments. Despite the fact that the RCA introduces some uncertainty into impact assessments, the approach inherently integrates environmental characteristics that shape biological variability, and consequently results in sound, broadly applicable conclusions that advance our understanding of the environmental effects of stressors.
Acknowledgements The ideas in this paper have evolved from discussions with too many people to mention here, but we thank them all. Portions of this study were funded through collaborations between the Ontario Ministry of the Environment, the Ontario Ministry of Natural Resources, Environment Canada, Trent University, Laurentian University, the University of Toronto, the University of Waterloo, the University of Guelph and the University of Western Ontario.
References Armitage PD. 2000. The potential of RIVPACS for predicting the effects of environmental change, p. 93–112. In Wright JF, Sutcliffe DW, Furse MT (ed.), Assessing the biological quality of fresh water: RIVPACS and other techniques. Freshwater Biological Association, Cumbria, UK.
Using the RCA in Bioassessments
Bailey RC, Kennedy MG, Dervish MZ, Taylor RM. 1998. Biological assessment of freshwater ecosystems using a reference condition approach: comparing predicted and actual benthic invertebrate communities in Yukon streams. Freshwater Biol. 39:765–774. Bailey RC, Norris RH, Reynoldson TB. 2004. Bioassessment of freshwater ecosystems using the reference condition approach. Kluwer Academic Publishers, Boston. Barbour MT, Gerritsen J, Snyder BD, Stribling JB. 1999. Rapid bioassessment protocols for use in wadeable streams and rivers: periphyton, benthic invertebrates, and fish. Second edition. EPA 841-B-99-002, U.S. Environmental Protection Agency, Washington, D.C. Barbour MT, Stribling JB, Karr JR. 1995. Multimetric approach for establishing biocriteria and measuring biological condition, p. 63–77. In Davis WS, Simon TP (ed.), Biological assessment and criteria: tools for water resource planning and decision making. Lewis Publishers, Fla. Barbour MT, Yoder CO. 2000. A multimetric approach to bioassessment, as used in the United States of America, p. 281–292. In Wright JF, Sutcliffe DW, Furse MT (ed.), Assessing the biological quality of fresh water: RIVPACS and other techniques. Freshwater Biological Association, Cumbria, UK. Barrett JB, Goldsmith L. 1976. When is n sufficiently large? Amer. Stat. 30:67–70. Bowman MF, Somers KM, Reid RA. 2003. A simple method to evaluate whether a biological community has been influenced by anthropogenic activity. Can. Tech. Rep. Fish. Aquat. Sci. 2510:62–72. Chapman MG. 1999. Improved sampling designs for measuring restoration in aquatic habitats. J. Aquat. Ecosyst. Stress Recovery 6:235–251. Davies PE. 1994. Monitoring river health initiative, river bioassessment manual. National River Processes and Management Program (Freshwater Systems: Tasmania). Davies PE. 2000. Development of a national river bioassessment system (AUSRIVAS) in Australia, p. 113–124. In Wright JF, Sutcliffe DW, Furse MT (ed.), Assessing the biological quality of fresh water: RIVPACS and other techniques. Freshwater Biological Association, Cumbria, UK. Dixit SS, Smol JP, Kingston JC, Charles DF. 1992. Diatoms: powerful indicators of environmental change. Environ. Sci. Tech. 26:23–33. Gerritsen J. 1995. Additive biological indices for resource management. J. N. Am. Benthol. Soc. 14:451–457. Green RH. 1979. Sampling design and statistical methods for environmental biologists. John Wiley and Sons, New York. Hawkins CP, Norris RH, Gerritsen J, Hughes RM, Jackson S, Johnson R, Stevenson RJ. 2000. Evaluation of the use of landscape classification for the prediction of freshwater biota: synthesis and recommendations. J. N. Am. Benthol. Soc. 19:541–556.
359
Hughes RM, Larsen DP, Omernik JE. 1986. Regional reference sites: a method for assessing stream potentials. Environ. Manage. 10:629–635. Kennen JG. 1999. Relation of macroinvertebrate community impairment to catchment characteristics in New Jersey streams. J. Am. Water Resour. Assoc. 35:939–955. Kilgour BW, Somers KM, Matthews DE. 1998. Using the normal range as a criterion for ecological significance in environmental monitoring and assessment. Ecoscience 5:542–550. Legendre P, Anderson MJ. 1999. Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecol. Monogr. 69:1–24. Legendre P, Legendre L. 1998. Numerical ecology. Second English Edition. Elsevier, Amsterdam. McBride GB, Loftis JC, Adkins NC. 1993. What do significance tests really tell us about the environment? Environ. Manage. 17:423–432. Moss D. 2000. Evolution of statistical methods in RIVPACS, p. 25–38. In Wright JF, Sutcliffe DW, Furse MT (ed.), Assessing the biological quality of fresh water: RIVPACS and other techniques. Freshwater Biological Association, Cumbria, UK. Moss D, Furse MT, Wright JF, Armitage PD. 1987. The predictions of macroinvertebrate fauna of unpolluted running water sites in Great Britain using environmental data. Freshwater Biol. 17:41–52. Norris RH. 1995. Biological monitoring: the dilemma of data analysis. J. N. Am. Benthol. Soc. 14:440–450. Omernik JM. 1987. Ecoregions of the conterminous United States. Ann. Assoc. Amer. Geog. 77:118–125. Osenberg CW, Schmitt RJ, Holbrook SJ, Abu-Saba KE, Flegal AR. 1994. Detection of environmental impacts: natural variability, effect size, and power analysis. Ecol. Appl. 4:16–30. Paterson AM, Cumming BF, Dixit SS, Smol JP. 2002. The importance of model choice on pH inference from scaled chrysophyte assemblages in North America. J. Paleolimnol. 27:379–391. Power M, Dixon DF, Power G. 1995. Detection and decision-making in environmental effects monitoring. Environ. Manage. 19:629–639. Quinn GP, Keough MJ. 2002. Experimental design and data analysis for biologists. Cambridge University Press, Cambridge, UK. Resh VH, Rosenberg DM, Reynoldson TB. 2000. Establishing the reference conditions in the Fraser River catchment, British Columbia, Canada, using the BEAST (BEnthic Assessment of SedimenT) predictive model, p. 181–194. In Wright JF, Sutcliffe DW, Furse MT (ed.), Assessing the biological quality of fresh water: RIVPACS and other techniques. Freshwater Biological Association, Cumbria, UK. Reynoldson TB, Bailey RC, Day KE, Norris RH. 1995. Biological guidelines for freshwater sediments based on benthic assessment of sediment (the BEAST) using a
360
Bowman and Somers
multivariate approach for predicting biological state. Aust. J. Ecol. 20:198–219. Reynoldson TB, Norris RH, Resh VH, Day KE, Rosenberg DM. 1997. The reference condition: a comparison of multimetric and multivariate approaches to assess water-quality impairment using benthic macroinvertebrates. J. N. Am. Benthol. Soc. 16:833–852. Reynoldson TB, Wright JF. 2000. The reference condition: problems and solutions, p. 293–304. In Wright JF, Sutcliffe DW, Furse MT (ed.), Assessing the biological quality of fresh water: RIVPACS and other techniques. Freshwater Biological Association, Cumbria, UK. Simpson JC, Norris RH. 2000. Summarising, presenting and interpreting outputs from RIVPACS and AUSRIVAS, p. 305–310. In Wright JF, Sutcliffe DW, Furse MT (ed.), Assessing the biological quality of fresh water: RIVPACS and other techniques. Freshwater Biological Association, Cumbria, UK. Underwood AJ. 1997. Experiments in ecology. Their logical design and interpretation using analysis of variance. Cambridge University Press, Cambridge, UK. VanSickle J, Hughes RM. 2000. Classification strengths of
ecoregions, catchments, and geographic clusters for aquatic vertebrates in Oregon. J. N. Am. Benthol. Soc. 19:370–384. Verdonschot PFM. 1995. Typology of macrofaunal assemblages: a tool for the management of running waters in The Netherlands. Hydrobiologia 297:99–122. Walker IR. 1993. Paleolimnological biomonitoring using freshwater benthic macroinvertebrates, p. 306–343. In Rosenberg DM, Resh VH (ed.), Freshwater biomonitoring and benthic macroinvertebrates. Chapman and Hall, New York. Wright JF. 2000. An introduction to RIVPACS, p. 1–24. In Wright JF, Sutcliffe DW, Furse MT (ed.), Assessing the biological quality of fresh water: RIVPACS and other techniques. Freshwater Biological Association, Cumbria, UK. Wright JF, Sutcliffe DW, Furse MT (ed.). 2000. Assessing the biological quality of fresh water: RIVPACS and other techniques. Freshwater Biological Association, Cumbria, UK.
Received: February 9, 2005; accepted: June 29, 2005.