SELF ORGANIZING FEATURE MAPS COMBINED WITH ... - CiteSeerX

2 downloads 0 Views 3MB Size Report
Variable Common Name. Tolerance. Native or. Introduced. Trophic. Status. Lithophilic. Spawner. AMBRLAMP. AMERICAN BROOK. LAMPREY. NOTYPE. N. FF.
Center for Urban Environmental Studies Northeastern University, Boston, MA 02115

TECHNICAL REPORT NO. 4

SELF ORGANIZING FEATURE MAPS COMBINED WITH ECOLOGICAL ORDINATION TECHNIQUES FOR EFFECTIVE WATERSHED MANAGEMENT

Hardik Virani, MSc. Research Associate

Prof. Elias Manolakos, Ph.D. Co-Investigator

Prof. Vladimir Novotny, Ph.D. Primary Investigator Project sponsored by the Grant No. R83-0885-010 to Northeastern University from the US EPA/NSF/USDA STAR Watershed Program

Bernice Smith EPA Project Officer Iris Goodman EPA Program Director

Boston, Massachusetts July 2005

Abstract In recent years there has been an increasing recognition for the application of integrated watershed management to the management of land and water resources, aimed at minimizing degradation and maintaining the overall quality of the surrounding environment. An efficient data summarization and visualization tool, besides aiding the watershed manager in formulating vital decisions, also helps to communicate and disseminate to a wider audience the physical conditions and possible approaches for watershed restoration. The aim of this work is to provide such an efficient data analysis and visualization tool in order to help understanding the effects of anthropogenic stress on the fish population through the fish Index of Biotic Integrity (IBI). Kohonen’s Self Organizing Map (SOM), an unsupervised neural network, is employed to pattern the sampling sites based on similar fish metrics (attributes of the fish IBI) characteristics. The clustering of the SOM neurons allows us to summarize the conclusions obtained in the analysis at the state level. Different visualizations are realized to explore the interrelationships between the environmental variables, the fish metrics and the fish IBI. Finally, Correspondence Analysis and Canonical Correspondence Analysis help us in drawing conclusions about specific fish species and the role of the environmental variables in maintaining the perfect abode for fishes. The methodology was applied to two large size datasets from Maryland and Ohio comprising of variables related to water chemistry, physical habitat and land uses, in addition to the biological data. Visualizing the fish IBI on the SOM clearly separated the regions with good IBI from the regions with poor values, validating the clustering approach employed. The high influence of habitat variables, compared to variables from other domains, confirms the growing trend in management policies that habitat should be a major consideration in developing non point source strategies Finally, the results indicate the efficiency of the SOM in visualizing the state of streams in the two states and aid the watershed manager in making and implementing decisions which will ultimately lead to restoration of the degraded watersheds.

i

Acknowledgements The research contained in this report was sponsored by the U.S. Environmental Protection Agency/National Science Foundation/U.S. Department of Agriculture STAR Watershed Program by a Grant No. R83-0885-010 to Northeastern University. The authors greatly appreciate this support. The findings and conclusions contained in this report are those of the authors and not of the funding agencies, nor the STAR program.

ii

Table of Contents Abstract..........................................................................................................................i Acknowledgements ......................................................................................................ii List of Figures...............................................................................................................v List of Tables ................................................................................................................x Chapter 1: Introduction ..............................................................................................1 1.1. Watershed Management......................................................................................1 1.1.1. Integrated Watershed Management .............................................................1 1.1.2. Objectives of this Project .............................................................................3 1.2. The Importance of Ecological Modeling ............................................................5 1.2.1. Mechanistic and Statistical Modeling..........................................................5 1.2.2. Regression Based Methods ..........................................................................6 1.2.3. Artificial Neural Networks ..........................................................................6 1.3. Objectives and Contributions.............................................................................8 1.4. Organization.......................................................................................................8 Chapter 2: Background and Related Work ............................................................10 2.1. Neural Network Methods in Ecological Modeling...........................................10 2.2. Kohonen Self Organizing Feature Maps...........................................................11 2.3. Ecological Ordination Techniques....................................................................14 2.3.1. Correspondence Analysis...........................................................................14 2.3.2. Canonical Correspondence Analysis .........................................................16 2.4. Datasets Analyzed.............................................................................................16 2.4.1. Maryland Biological Stream Survey..........................................................16 2.4.2. Ohio EPA Dataset ......................................................................................17 Chapter 3: Methodology and Results.......................................................................18 3.1. A Modeling Methodology for Large Scale Data Analysis and Visualization ..18 3.2. Analysis of Maryland Data ...............................................................................20 3.2.1. Determining the Size of the SOM..............................................................21 3.2.2. The U Matrix Representation of the SOM.................................................22 3.2.3. Clustering of the SOM Neurons ................................................................23 3.2.4. Relationships between Biological and Environmental Variables..............27 3.2.5. Combining SOM Patterning with Correspondence Analysis ....................40 3.2.6. Canonical Correspondence Analysis .........................................................42 3.2.7. Discussion ..................................................................................................46 Chapter 4: Analysis of the Ohio EPA Dataset.........................................................48 4.1. Determining the Size of the SOM.....................................................................49 4.2. The U Matrix Representation of the SOM........................................................49 4.3. Clustering of the SOM Neurons .......................................................................50 4.4. Relationships between Biological and Environmental Variables.....................53 4.4.1. Visualization of Fish Metric Gradients on the SOM .................................53 4.4.2. Visualization of the Fish IBI......................................................................54 4.4.3. Visualization of Invertebrate Community Index (ICI) and Qualitative Habitat Evaluation Index (QHEI) ........................................................................58 4.4.4. Visualization of Environmental Variable Gradients on the SOM .............60 4.4.5. Exploring the Relationships between the Environment Variables, Fish Metrics and the Fish IBI ......................................................................................63 4.5. Combining SOM Patterning with Correspondence Analysis ...........................67 4.6. Canonical Correspondence Analysis ................................................................69 4.8. Discussion .........................................................................................................71

Chapter 5: Conclusions and Future Research ........................................................73 5.1. Graphical User Interface ...................................................................................73 5.2. Future Research ................................................................................................75 Bibliography ...............................................................................................................77 Appendices..................................................................................................................83

List of Figures Figure 1-1: A Strategic approach to integrated watershed management adopted from [ESCAP 1997]. ......................................................................................................2 Figure 1-2: Food web schematics. Source: [Novotny 2003] .........................................4 Figure 1-3: Schematic of the multilayer risk propagation model. Source: [Novotny 2004] ......................................................................................................................4 Figure 2-1: Feed Forward Network with one hidden layer..........................................11 Figure 2-2: Kohonen Self Organizing Map showing the input and the output layer, modified from [Roussinov and Chen 1998].........................................................12 Figure 2-3: An example of a cumulative species response to an environmental gradient. ...............................................................................................................15 Figure 3-1: Methodology showing how the SOM is combined with the Correspondence Analysis and Canonical Correspondence Analysis. The SOM is trained using the fish metrics vectors in order to pattern the sampling sites. The SOM neurons are then clustered using the k-means algorithm. For each SOM neuron the average (over all sampling sites patterned in that neuron) of the environmental variables and the fish species in the database are calculated. CA/CCA can then be used with the reduced datasets to provide summarized information...........................................................................................................19 Figure 3-2: Quantization and topographic errors as a function of the SOM map units. We selected 60 = 12 X 5 as the optimal number of map units. ...........................22 Figure 3-3: Representation of the SOM U matrix. The inter-unit values are the Euclidean distances between adjacent map units. The levels of gray shown inside a specific unit are found by taking the median of the surrounding gray level values. The U matrix visually suggests the presence of three groups of neurons (clusters)...............................................................................................................22 Figure 3-4: The Davies-Bouldin index used in conjunction with the k-means algorithm to find the optimal number of clusters present in the SOM. The index achieves a minimum (optimal) value for 3 clusters. ............................................23 Figure 3-5: k-means clustering of the SOM neurons and spatial distribution of sites in each cluster for Maryland. The bottom left panel shows the result of the k-means clustering of the SOM neurons after 100 iterations, and largely agrees with the U matrix (Figure 3-3). Cluster 1 sites are concentrated in the western part of the state. The coastal area sites are mostly in Cluster 3.............................................24 Figure 3-6: Distribution of the 6 Level III Ecoregions in Maryland. The vertical axis represents the total sites in Maryland for each category. The barplots have been arranged from left to right, designated as Overall (all the 955 sampling sites in Maryland), Cluster 1, Cluster 2 and Cluster 3. The numbers in front of each legend indicates the corresponding Ecoregion number. The numerical results are tabulated in Table 3-2. .........................................................................................25 Figure 3-7: Distribution of the Stream Order in Maryland. The vertical axis represents the total sites in Maryland for each category. The barplots have been arranged from left to right, designated as Overall (all the 955 sampling sites in Maryland), Cluster 1, Cluster 2 and Cluster 3. The results are tabulated in Table 3-3. .........26 Figure 3-8: Fish metric component planes visualized on the SOM (left) and the corresponding boxplots (right) for the individual clusters. There is a clear distribution gradient in most of the metric components. The ranges for the v

metrics are shown in the corresponding colorbar. The metrics abbreviations correspond to those listed in Table 3-1. ...............................................................27 Figure 3-9: Distribution of the fish IBI on the SOM (left) and the corresponding boxplots (right) for the individual clusters. Overall indicates that the data represents all the SOM neurons which cover data from the entire state. The low values of the Fish IBI are concentrated in the top right area of the SOM (mostly in sites belonging to neurons of cluster 3). ..........................................................29 Figure 3-10: Distribution of the fish IBI in Maryland. The averaged fish IBI values on the SOM are duplicated for the sampling sites falling in the same SOM neuron and reproduced on the map. .................................................................................30 Figure 3-11: Comparing the actual IBI values with the SOM averaged fish IBI in Maryland. The Mean Square Error (MSE) was calculated. Each frame represents a SOM cell neuron with the label indicated at the bottom right of the frame (in steps of 5). While the scatter points in each frame represent the observed fish IBI of the sites falling in that SOM neuron, the horizontal line in the frame represents the averaged IBI value of those sites. ..................................................................31 Figure 3-12: Distribution of the Benthic IBI on the SOM (left) and the corresponding boxplots (right) for the individual clusters. The different sizes of the neurons indicate the associated cluster for the neuron, whereby the largest (smallest) size indicates Cluster 1 (Cluster 3). Overall indicates that the data represents all the SOM neurons which cover data from the entire state. The results are similar to the fish IBI (Figure 3-9).......................................................................................32 Figure 3-13: Distribution of the PHI on the SOM (left) and the corresponding boxplots (right) for the individual clusters. The different sizes of the neurons indicate the associated cluster for the neuron, whereby the largest (smallest) size indicates Cluster 1 (Cluster 3). Overall indicates that the data represents all the SOM neurons which cover data from the entire state..........................................33 Figure 3-14: Distribution of the Water Chemistry variables on the SOM (left) and the corresponding boxplots (right) for the individual clusters. Overall indicates that the data represents all the SOM neurons which cover data from the entire state.34 Figure 3-15: Distribution of the Physical Habitat variables on the SOM (left) and the corresponding boxplots (right) for the individual clusters. Overall indicates that the data represents all the SOM neurons which cover data from the entire state.35 Figure 3-16: Distribution of the Land Use (Landscape) variables on the SOM (left) and the corresponding boxplots (right) for the individual clusters. Overall indicates that the data represents all the SOM neurons which cover data from the entire state. ...........................................................................................................36 Figure 3-17: Correlation matrix indicating the correlation between the environmental variables and the fish metrics and indices of integrity. The gray scale indicates the absolute value of the correlation, while the sign of the correlation is indicated in associated block. ..............................................................................................37 Figure 3-18: Visualizing the relationship between the 10 Fish Metrics (top row) and the environmental variables (bottom row). Each variable and metric is plotted in the cluster where it exhibits maximal/minimal per cluster median value. Furthermore its name is listed within the neuron where it achieves the maximum/minimum in the designated cluster.....................................................39 Figure 3-19: Correspondence Analysis bi-plot showing association of fish species to clusters of SOM neurons obtained via k-means clustering. The first two axes account for 28% and 15% of the variation respectively. SOM cell numbers are shown with different colors according to the cluster they belong to, as shown on vi

the colorbar, while the fish species are indicated by the cluster ticks, as shown in the legend. The fish species names have not been included for clarity but are listed in Table 3-6. ...............................................................................................41 Figure 3-20: Canonical Correspondence Analysis plot showing the environment variables scaled by a factor of 4 and the scores of the k-means clustered SOM neurons in three different shades of gray. The first two axes account for 32% and 16% of the variation respectively. The variable-neuron cluster associations are indicated according to Table 3-8. See more details in text. .................................44 Figure 3-21: Comparative ranking of the environment variables based on the length of the arrow of the environment variables in Figure 3-20.The length of the arrow represents the importance of each variable in explaining the variation in fish species distribution. Only the top 20 variables have been shown. The labels are indicated in Appendix A: Table A1. ....................................................................47 Figure 4-1: Quantization and topographic errors as a function of the SOM map units. We selected 45 = 9 X 5 as the optimal map size. ................................................49 Figure 4-2: Representation of the SOM U matrix. ......................................................50 Figure 4-3: The Davies-Bouldin index used in conjunction with the k-means algorithm to find the optimal number of clusters in the SOM. We selected 3 clusters. ................................................................................................................51 Figure 4-4: k-means clustering of the SOM neurons and spatial distribution of sites in each cluster for Ohio. The bottom right panel shows the result of the k-means clustering of the SOM neurons after 100 iterations. The clusters are distributed across the entire state. ..........................................................................................52 Figure 4-5: Distribution of the 5 Level III Ecoregions in Ohio. The y axis represents the total sites in Ohio for each category. The barplots have been arranged from left to right, designated as Overall (all the 1848 sampling sites in Ohio), Cluster 1, Cluster 2 and Cluster 3. The numbers in front of each legend indicates the corresponding Ecoregion number. .......................................................................53 Figure 4-6: 12 Component planes for the fish metric scores visualized on the SOM (left) and the corresponding boxplots (right) for the individual cluster distributions. There is a clear gradient distribution in most of the metric components. The ranges for the metric scores are shown in the corresponding colorbar. ...............................................................................................................54 Figure 4-7: Distribution of the fish IBI on the SOM (left) and the corresponding boxplots (right) for the individual clusters. The different sizes of the neurons indicate the associated cluster for the neuron, whereby the largest (smallest) size indicates Cluster 1 (Cluster 3). Overall indicates that the data represents all the SOM neurons which cover data from the entire state. The low values of the Fish IBI are concentrated in the top right area of the SOM (Cluster 3). .....................55 Figure 4-8: Relationship of biological integrity to the quantitative biological criteria and the habitat uses in the Ohio Water Quality Standards. Source: [Yoder and Rankin 1998]........................................................................................................56 Figure 4-9: Distribution of the fish IBI in Ohio. The averaged fish IBI values on the SOM are duplicated for the sampling sites falling in the same SOM neuron and reproduced on the map.........................................................................................57 Figure 4-10: Comparing the actual IBI values with the SOM averaged fish IBI in Ohio. The Mean Square Error (MSE) is calculated as 17.4142. Each frame represents a SOM cell neuron with the label indicated (in steps of 5) at the bottom right of the frame. While the scatter points in each frame represent the observed fish IBI of

vii

the sites falling in that SOM neuron, the horizontal line in the frame represents the averaged IBI value of those sites. ..................................................................58 Figure 4-11: Distribution of the ICI on the SOM (left) and the corresponding boxplots (right) for the individual clusters. The different sizes of the neurons indicate the associated cluster for the neuron, whereby the largest (smallest) size indicates Cluster 1 (Cluster 3). Overall indicates that the data represents all the SOM neurons which cover data from the entire state. ..................................................59 Figure 4-12: Distribution of the QHEI on the SOM (left) and the corresponding boxplots (right) for the individual clusters. The different sizes of the neurons indicate the associated cluster for the neuron, whereby the largest (smallest) size indicates Cluster 1 (Cluster 3). Overall indicates that the data represents all the SOM neurons which cover data from the entire state..........................................60 Figure 4-13: Distribution of the Water Chemistry variables (I) on the SOM (left) and the corresponding boxplots (right) for the individual clusters. Overall indicates that the data represents all the SOM neurons which cover data from the entire state. The labels are indicated in Appendix B: Table B 1....................................61 Figure 4-14: Distribution of the Water Chemistry variables (II) on the SOM (left) and the corresponding boxplots (right) for the individual clusters. Overall indicates that the data represents all the SOM neurons which cover data from the entire state. The labels are indicated in Appendix B: Table B 1....................................62 Figure 4-15: Distribution of Physical Habitat and Land Use (Landscape) variables on the SOM (left) and the corresponding boxplots (right) for the individual clusters. Overall indicates that the data represents all the SOM neurons which cover data from the entire state. The labels are indicated in Appendix B: Table B 1 ...........63 Figure 4-16: Correlation matrix indicating the correlation between the environmental variables and the fish metrics and indices of integrity. The gray scale indicates the absolute value of the correlation, while the sign of the correlation is indicated in associated block. ..............................................................................................64 Figure 4-17: Visualizing the relationship between the 12 Fish Metrics (top row) and the environmental variables (bottom row). Each variable and metric is plotted in the cluster where it exhibits maximal/minimal per cluster median value. Furthermore its name is listed within the neuron where it achieves the maximum/minimum in the designated cluster.....................................................66 Figure 4-18: Correspondence Analysis bi-plot showing association of fish species to clusters of SOM neurons obtained via k-means clustering. The first two axes account for 25% and 17% of the variation respectively. SOM cell numbers are shown with different colors according to the cluster they belong to, as shown on the colorbar, while the fish species are indicated by the cluster ticks, as shown in the legend The fish species names have not been included for clarity but are listed in Table 4-5. ...............................................................................................67 Figure 4-19: Canonical Correspondence Analysis plot showing the environment variables scaled by a factor of 4 and the scores of the k-means clustered SOM neurons in three different shades of gray. The first two axes account for 28% and 19% of the variation respectively. The variable-neuron cluster associations are indicated according to Table 4-7..........................................................................71 Figure 4-20: Comparative ranking of the environment variables based on the length of the arrow of the environment variables in Figure 4-19. The length of the arrow represents the importance of each variable in explaining the variation in fish species distribution. Only the top 20 variables have been shown. The labels are indicated in Appendix B: Table B 1. ...................................................................72 viii

Figure 5-1: Introductory screen prompting the user to select either of the two states used in this research. ............................................................................................74 Figure 5-2: Snapshot of the GUI for the Ohio EPA dataset, indicating the methodology. .......................................................................................................75

ix

List of Tables Table 3-1: Metrics used to compute the Fish IBI in the MBSS [Roth et al. 2000]......21 Table 3-2: Distribution of ecoregions in each cluster in Maryland. The percentages indicate the proportion of the sites in each ecoregion for each cluster. ...............25 Table 3-3: Distribution of the stream order in each cluster in Maryland. The percentages indicate the proportion of the sites in each cluster for a particular stream order. ........................................................................................................26 Table 3-4: Table listing the fish metrics in the cluster column in which their percluster median value is maximized. The labels for the metrics are given in Table 3-1 ........................................................................................................................38 Table 3-5: Table indicating the environmental variables in the cluster column in which their -per cluster median value is maximized. The labels for the environment variables are listed in Appendix A: Table A1 ................................38 Table 3-6: Fish species associated with clusters. Row wise arrangement of species in each cluster section corresponds to traversing the corresponding ticks from right to left, top to bottom within the cluster area indicated in Figure 3-19.................42 Table 3-7: Percentage of fish species having a certain ecological characteristic (e.g. NATIVE) thus contributing to the corresponding metric (e.g. NUMNATIVE) computed after CA and clustering of fish species (using Table 3-6 only). Percentages are maximized in same cluster (shown by bold font) that the percluster median of the corresponding metric is maximized (Table 3-4). ..............42 Table 3-8: Association of environmental variables with k-means based neuron clusters. Each variable is listed in the column corresponding to the cluster with the largest median projection on that variable’s line. Bold text highlights variables for which both methods (SOM and the SOM combined with CCA) confirm a strong association with a cluster. For the other variables, the numbers in parenthesis indicate the SOM assigned cluster association (Table 3-5). Furthermore, if a variable achieves maximum per cluster median value in a border neuron, the neighboring cluster number is also noted. ...........................................................43 Table 4-1: Metrics used to compute the Fish IBI in the Ohio EPA [OhioEPA 1987]. 48 Table 4-2: Distribution of ecoregions in each cluster in Ohio. The percentages indicate the proportion of the sites in each ecoregion for each cluster. ...............52 Table 4-3: The fish metrics are listed in the cluster column in which their per cluster median value is maximized..................................................................................65 Table 4-4: The environmental variables listed under the cluster column in which their per cluster median value is maximized. The labels are indicated in Appendix B: Table B 1. The results have been summarized from Appendix B: Table B 5 .....65 Table 4-5: Fish species associated with SOM neuron clusters. Row wise arrangement of species in each cluster section corresponds to traversing the corresponding ticks from right to left, top to bottom in Figure 4-18...........................................68 Table 4-6: Percentage of fish species having a certain ecological characteristic, (Appendix B: Table B 3) contributing to one of the 12 fish Metrics, computed after CA and clustering of the fish species based on Table 4-5...........................68 Table 4-7: Association of environmental variables with k-means based SOM neuron clusters. Each variable is listed in the column corresponding to the cluster with the largest median projection on that variable’s line. Bold text highlights variables for which both methods (SOM alone and the SOM combined with CCA) confirm association with a cluster. For the other variables, the numbers in x

parenthesis indicate the SOM assigned cluster association (Table 4-4). Furthermore, if a variable achieves maximum per cluster median value in a border neuron, the neighboring cluster number is also noted. .............................70

xi

Chapter 1: Introduction 1.1. Watershed Management A major goal of the Clean Water Act, passed by the US Congress in 1972, is restoring and maintaining the chemical, physical, and biological integrity of the Nation’s waters. In its “National Water Quality Inventory: 1998 Report to Congress”[USEPA 2000], EPA concludes that 40 percent of the nation's assessed waterways remain too polluted for fishing and swimming. A major contribution of the Clean Water Act was the requirement of the states to establish TMDL’s (Total Maximum Daily Loads) on impaired water bodies within their jurisdiction, aimed at reducing both point source pollution (specific source) and nonpoint source (e.g. storm water run off) pollution. The goal of preserving environmental quality while sustaining economic prosperity will require management approaches that integrate human and natural systems, rather than isolated activities involving construction of additional control works, more regulations, or more money.

1.1.1. Integrated Watershed Management In recent years there has been an increasing recognition for the need of a new holistic and pragmatic approach to the management of land and water resources, aimed at minimizing degradation and maintaining the overall quality of the surrounding environment. Such an approach, called integrated watershed management, involves the implementation of a coherent land and water management system which can ameliorate the adverse impacts of human activities or natural disasters. The goal of integrated watershed management is to adopt a systems approach to plan and work toward an environmentally and economically viable healthy watershed, taking into account the complex interrelationships between the parameters of interest. To achieve these objectives, the integrated watershed management should revolve around the following tasks, as highlighted in Figure 1-1: 1. Mutual coordination of policies, programs and activities between the various agencies. 2. Encouraging community participation in the management process. 3. Identification and successful rehabilitation of natural resource degradation. 4. Ensuring sustainable use of natural resources. 5. Provision of high quality water and protective vegetative cover within individual watersheds. In view of the above approach, the specific issues before the watershed managers include ways to conserve wildlife habitat, maintain water quality, ensure soil stability, while also complying with government regulations. Watershed managers need to be able to make a local assessment of multiple stressor effects on ecological vulnerability of the water bodies, point out those stresses that have the largest impact and, subsequently, propose and develop a cost-effective practical remedy. In addition to the traditional challenges of meeting water quantity and quality regulations and flood mitigation requirements, today's hydrologic engineers and watershed mangers must

1

Chapter 1: Introduction

also account for sensitive and exotic species, the arrival of new water-borne diseases such as the West Nile virus, and increasingly complex regulations.

Figure 1-1: A Strategic approach to integrated watershed management adopted from [ESCAP 1997].

Visualization plays an important role in watershed management. An efficient visualization software tool, besides aiding the watershed manager in formulating vital decisions, also helps to communicate and disseminate to a wider audience the physical conditions and approaches for watershed restoration. Simulation models are major components in the fields of water resources assessment, development, and management. They are widely accepted within the water resources community and are usually designed to predict the response of a system under a particular set of conditions. SSARR (Streamflow Synthesis And Reservoir Regulation-U.S. Army Corps of Engineers, North Pacific Division)[SSARR 2005], HEC-RAS (River Analysis System-Hydrologic Engineering Center)[RAS 2005], QUAL (stream water quality model-Environmental Protection Agency), HEC5(simulation of flood control and conservation systems-Hydrologic Engineering Center)[HEC 2005], SUTRA (Saturated-Unsaturated TRAnsport model-US Geological Survey)[SUTRA 2005], and KYPIPE (pipe network analysis-University of Kentucky) [KYPIPE 2005]were some of the earlier simulation models, developed primarily in the FORTRAN language [Simonovic 2000]. The explosive growth of information technology has given a major boost to the development and implementation of managerial strategies in watershed management 2

Chapter 1: Introduction

and restoration. Geographic information systems (GIS) are becoming important components of simulation models and decision systems, increasingly being used to store both georeferenced data and associated attributes. GIS technology has played critical roles in all aspects of watershed management, from assessing watershed conditions through modeling impacts of human activities on water quality and to visualizing impacts of alternative management scenarios. GIBSI [Rousseau et al. 2000] is one such integrated modelling system comprising of physically-based simulation models (hydrological, soil erosion, agricultural-chemical transport and water quality), a relational database management system and a GIS. Novel techniques such as fuzzy logic, artificial neural networks and genetic algorithms are increasing tested against ecological data [Recknagel 2001]. The advantage of these approaches is the self selection of the critical model inputs based on their data-driven approach. The Land Transformation Model (LTM) employs GIS and Artificial Neural Networks to describe the influence of landscape changes on ecosystem integrity of large areas [Pijanowski et al. 2000]. The LTM currently employs a multiplayer perceptron (MLP) neural net topology with one or two hidden layers; each layer has at least the same number of nodes as the number of input vectors. ANNs are used to learn the patterns of development in the region and test the predictive capacity of the model, while GIS is used to develop the spatial, predictor drivers and perform spatial analysis on the results [Pijanowski et al. 2002]. New tools such as global positioning systems (GPS) and remote sensing are also being developed to inventory and monitor watershed characteristics [Guertin et al. 2000]. The tools ArcPad, which are developed in the ArcPad environment, try to implement GISs, mobile computing systems, satellite and aerial images, network interconnection in the frame of standard ecological methodology [Matejicek 2003]. All these tools provide a better understanding of how ecological systems are managed on various levels of the ecological research.

1.1.2. Objectives of this Project Traditionally, river water quality assessment has been based on chemical analysis. However, it seems natural that biological assessment should be included in any determination of quality, for the biotic community constitutes a key component of the environment. Biological monitoring has thus become established as an integral part of water quality monitoring. Overlooking the biological resources in the ecological system is one of the important tasks in watershed management and restoration. Biological communities respond to and integrate a wide variety of chemical, physical and biological factors in the environment whether of natural or anthropogenic origin. Biosurveys can therefore indicate potential variations in water quality. It may be important to note that fish occupy levels IV and V, just below shellfish eating fowl and humans in the aquatic food web (Figure 1-2) and are consumed by humans thus making them important for assessing contamination. Fish are good indicators of long term effects and also indicate broad habitat conditions because they are relatively long lived and mobile. Most fishes would move away where they detect 3

Chapter 1: Introduction

pollution or habitat degradation in the water environment, but their fry (eggs) cannot escape.

Figure 1-2: Food web schematics. Source: [Novotny 2003]

The fish survey protocol in the present research is based largely on Karr’s Index of Biotic Integrity (IBI) with some modifications for regional conditions [Karr 1991]. Under this technique, multiple attributes of the resident fish assemblage are employed to evaluate the health and integrity of a stream. In particular, the fish IBI examines four components of the fish community – abundance, diversity, tropic interactions, and fish health, expressed in terms of fish metrics to calculate a numerical entity.

Figure 1-3: Schematic of the multilayer risk propagation model. Source: [Novotny 2004]

4

Chapter 1: Introduction

Quantifying the relationships between the stressors, water quality, sediment and habitat risks and the biological endpoints (IBIs) is one of the core aims of our STAR Watershed Project [STAR 2005]. We envision a hierarchical watershed scale model, aimed at identifying the factors that affect the integrity of the receiving water bodies. The proposed layout of the model is highlighted in Figure 1-3. In this concept each individual metric of the IBI is related to a number of stressors and risks, including the risk (or index) of habitat impairment (metric by metric), water quality/pollution risk (parameter by parameter) and sediment contamination risk (parameter by parameter). In [Novotny et al. 2005], it was hypothesized that the relationship progresses in layers from multiple landscape and pollution allochthonous stressors (lowest layer) to in stream effects expressed, for example, by time series of concentrations, flow and temperature variability, or habitat disruption and fragmentation (layer 2), risks probabilities (layer 3), and finally fish and benthic macroinvertebrate composition expressed by IBIs (top layer). This approach emphasizes on the identification of the root stressors and risk progression from the root stressors to the biotic endpoints.

1.2. The Importance of Ecological Modeling The complexity and diversity of ecological systems results in complex spatial and temporal relations between the biological and abiotic variables, justifying the uses of multiple modeling techniques. Models provide an opportunity to explore ideas regarding ecological systems that may not be possible to field-test for logistical, political, or financial reasons [Jackson et al. 2000]. A good conceptual model should be simple in implementation and thought and yet be able to capture the details of the underlying ecological structure. The basic requirements in model development include hypotheses formulation, determination of available and required data, and an in-depth assessment of key components of the system. Models based on different statistical and simulation techniques, are designed to corroborate, analyze and conclude community dynamics with respect to the surrounding environment. Practitioners in the field of environmental engineering rely heavily on mathematical models to better understand the effects of natural and anthropogenic stressors to ecological systems. Ecological data used in these techniques are non-linear and complex. In most situations the collection of field data is both time-consuming and expensive. This poses serious obstacles to the development of predictive systems modeling for environmental engineering applications.

1.2.1. Mechanistic and Statistical Modeling Mechanistic and statistical modeling represent two different, yet complementary, approaches for the modeling of systems [Kendall et al. 1999;Tewari et al. 2001]. While mechanistic models follow the traditional physics approach by focusing on describing the underlying ecological mechanisms, statistical or empirical modeling aims at understanding the system via data summarization. A mechanistic model is a representation of the physical, biological, or mechanistic theory governing the system; in contrast, a statistical model accounts for the statistical fitting of equations to the available data. Although mechanistic models offer flexibility in implementation and true fundamental understanding of various nonlinear systems, lack of mechanistic understanding and modeling expertise are major deterrents in the choice of mechanistic modeling. Under such circumstances, statistical modeling provides a 5

Chapter 1: Introduction

means to quantify the system data and identify trends that may suggest further monitoring activities.

1.2.2. Regression Based Methods Regression allows us to model the dependence of (usually) one response variable on one or more predictor variables. The simple models are based on a stepwise progression of univariate and multivariate analysis. Simple linear regression involves discovering the equation for a line that most nearly fits the given data. These models are easier to test in replication and cross-validation studies. Furthermore, they are less costly to put into practice in predicting and controlling the outcome in the future. Regression analysis has traditionally been employed in models with the aim of exploring the relationships between environment variables and fish biota. In fact, [Fausch et al. 1988] reviewed 95 ecological models between 1950 and 1985 and noted 79 models with linear or multiple regression.

1.2.3. Artificial Neural Networks Artificial Neural Networks [Bishop 1995;Haykin 1999] are versatile tools based on the problem solving process of the human brain, used to extract information out of complex data sets, and effectively apply it for association and classification purposes. ANNs have advantages over more traditional methods when modeling poorly defined and understood systems and in situations where input data are incomplete or ambiguous by nature. Unlike more commonly used statistical methods, neural networks are not dependent on particular functional relationships, make no assumptions regarding the distributional properties of the input data, and require no a priori understanding of variable relationships. This independence makes ANNs a potentially powerful family of tools for exploring complex, nonlinear biological problems [Lek and Guégan 2000], such as, for example, capturing the relationships believed to exist between fish and their surrounding environment. One of the most important aspects of Neural Networks is the learning process. Learning can be done in supervised or unsupervised manner. 1.2.3.1. Supervised Learning In supervised training, both the inputs and the outputs are provided. The inputs are processed by the network and the resultant outputs are compared against the desired outputs. A measure of error in the network can be estimated, causing the system to adjust the weights to control the network. This process occurs over and over as the weights are continually tuned to represent the data in the most optimal sense. Different network structures, such as Multi-layer Perceptron, Radial Basis Network etc can be employed to characterize this mapping using input-output examples and supervised training algorithms, such as the well known Back Propagation algorithm [Haykin 1999].

6

Chapter 1: Introduction

1.2.3.2. Unsupervised Learning In unsupervised learning, the network is provided with inputs but not with desired outputs. The system itself must then decide what features it will use to group the input data. This is often referred to as self-organization. This concept is employed to discover significant patterns, or features, in the input data, also known as data mining. The input patterning problem revolves around the perception of dimensionality reductions in which input vectors are grouped into clusters using repeated presentations. By analyzing the emerging relationships between the various input variables, a good representation of the system can be obtained. Kohonen’s SelfOrganizing Feature Maps (SOM) [Kohonen 2001] and Artificial Resonance Theory (ART) [Carpenter and Grossberg 1988] are well-known neural network structures trained with unsupervised learning rules that are widely applied in data analysis. There are a number of advantages in using the ANN technology for modelling applications in environmental engineering and science. In general, ANNs can be used to model multivariate data sets with variables that are continuous or discrete, have linear or nonlinear responses, and that vary independently of each other. Artificial neural network applications can be developed and deployed quickly and easily with very little programming, owing to the existence of a number of user-friendly ANN software packages and continual research into model protocol development. ANN based modeling is now considered a mature subfield of applied statistics and many commercially available data analysis software packages, such as MATLAB [Matlab 2004], SPSS [SPSS 2004] and others, offer neural network add-on modules that can be used in a variety of applications domains. Finally, the ANN technology is inherently robust and fault-tolerant because of the parallel computational structure and distributed memory of developed models. As with any modelling technique, there are challenges associated with the use of ANNs that potential users must be aware of. First, the ANN technology is data reliant and requires a sufficient quantity of representative data to capture key variable relationships efficiently. From a computational point of view, ANN applications are usually very intensive and the training phase can be time consuming depending on the network structure. Since knowledge regarding the relationships modeled by ANNs is generally limited, selecting the appropriate inputs is also one of the most difficult tasks in the development of an ANN based model. Recent studies [Gevrey et al. 2003;Kingston et al. 2004;Olden and Jackson 2002;Olden et al. 2004] have tried to remove the “black box” label attached to ANNs, assess the statistical significance of potential inputs and reduce the dimensionality of the problem. Research into ANNs has led to the development of various types of neural networks, capable of solving different kinds of problems: auto-associative memory, generalization, optimization, data reduction, control and prediction tasks under a wide array of network architectures and problem scenarios. Increasing publications of successful novel neural networks has increased the applicability of ANN technology in the ecological domain. The STAR project attempts to utilize the ANN as a powerful tool, to interpret the interaction of the biological communities with the environment and explore the fascinating diversity of these communities.

7

Chapter 1: Introduction

1.3. Objectives and Contributions It is pertinent to the watershed manager to have an integrated tool that supports quantitative analysis and visualization of the inherent dynamics of the community structure in the watershed. The proposed methodology strives to unravel complex nonlinear ecological relationships and be a great asset to the watershed manager, in detecting emerging trends, performing “what if” investigations to assess the possibility of different scenarios, and in weighting the various factors and options at hand as needed to formulate a cost-effective action plan. While simplified interpretations may be beneficial, the relationship between the biological community and the environment is highly complex and non-linear. It thus remains a challenge to ensure that the information lost through data analysis is minimal. The aim of this work is to provide the watershed manager an efficient data analysis and visualization tool in order to help understanding the effects of anthropogenic stress on the fish population through the fish IBI and ensure an in-depth evaluation and diagnosis of the problems plaguing the watersheds. The proposed approach aims to pattern the sampling sites based on the fish metrics thorough Kohonen Self Organizing Map (SOM). The SOM, an unsupervised neural network algorithm is employed to display high dimensional data in a 2 dimensional space. One of the major strengths of SOM, put to good use in this research is the ability to produce effective and ordered visualization of highly complex multidimensional datasets. This approach allows us to cluster sites based on the metrics, which gets manifested in the form of fish IBI. The clusters are a collection of the SOM cell neurons which are a collection of the sampling sites with similar Metric characteristics. As a result, all the sampling sites with a poor fish IBI get lumped together, which enables further segregation of the factors affecting the fish IBI in terms of the clusters. Further analysis by Correspondence Analysis (CA) and Canonical Correspondence Analysis (CCA) allows us to quantify and visualize the relationships believed to exist between the involved parameters. The methodology was applied to ecological data sampled from Maryland and Ohio. The report attempts to offer the watershed manager, a parallel visualization of all the environmental variables in relation with the 3 strata of the fish community: the species level, the metrics and the fish IBI.

1.4. Organization Chapter 1 highlighted the purpose and the scope of the report. In Chapter 2, we describe in detail, the various techniques we have implemented in the ensuing methodology. Primarily, this chapter stresses upon the Kohonen Self Organizing Map, Correspondence Analysis and Canonical Correspondence Analysis, with the associated literature review. We conclude the chapter with a brief description of the two datasets – Maryland Biological Stream Survey (MBSS) and the Ohio EPA dataset, which have been applied in the report. Chapter 3 is divided into two parts, one introducing the conceptual methodology, and the other emphasizing the complete analysis of the MBSS dataset. We present the results with the associated discussion as they apply to the MBSS dataset. Again, in Chapter 4, the same methodology is applied to the Ohio EPA dataset, giving us scope to compare the results as they are

8

Chapter 1: Introduction

applied to both the datasets. Finally, in Chapter 5, we conclude our findings, with an overview of future research, which will extend the scope of this technical report.

9

Chapter 2: Background and Related Work In this chapter, we provide a preview of the different techniques used in the research with the associated literature review. The advantages of using Artificial Neural Networks over regression techniques in ecology are emphasized. A comprehensive understanding of an unsupervised Neural Network - Kohonen Self Organizing Map and Ecological Ordination techniques - Correspondence Analysis and Canonical Correspondence Analysis is provided. We conclude the chapter with the specifics of the Maryland Biological Stream Survey dataset and the Ohio EPA dataset, which are employed in the methodology.

2.1. Neural Network Methods in Ecological Modeling Regression analysis is a simple yet reliable and widely applied statistical method, used frequently in predictive modeling. A multiple regression allows the simultaneous modelling and testing of multiple independent variables. Regression of a biological condition on chemical and/or physical stressors is normally employed to provide a more refined analysis about the interrelationships in the environment [Feck and Hall 2004;Oberdorff et al. 2001;Reash and Pigg 1990;Rogers et al. 2002;Wiley et al. 2004]. A multiple regression model was used by [Dow and Zampella 2000] to relate the percentage of altered land to pH and specific conductance, to assess their joint use as indicators of watershed disturbance. [Comeleo et al. 1996] analyzed relationships between first principal components for sediment metals and organics concentrations and watershed stressor variables for 25 Chesapeake Bay sub-estuaries using rank correlation and stepwise multiple regression techniques. [Çamdevýren et al. 2005] similarly used principal component scores in multiple linear regression analysis to predict Chlorophyll-a levels from chemical, physical and biological water quality variables in the Çamlidere reservoir in Turkey. Neural network applications to ecological modeling or, more general, to ecology are quite recent. In fact, the first references about the potential use of neural networks in ecological modeling appeared in the early 90s. [Lek et al. 1996] and [Olden and Jackson 2001] have shown that an ANN may work better than multiple regression techniques in applications involving non-linear complex relationships between a large number of variables. [Lek et al. 1996] applied both techniques to a set of ecological data involving the study of the relationship between density of brown trout spawning sites (redds) and habitat characteristics, and concluded that ANNs provide a powerful predictive alternative to the traditional MR techniques. [Olden and Jackson 2001] similarly used simulated and empirical examples and showed that ANNs exhibit greater predictive power than traditional regression approaches for modeling species occurrence and abundance. Similar results were demonstrated in [Baxter et al. 2004], where it was shown that ANN models are better suited than multiple regression approaches to model the filter effluent particle counts and filter effluent turbidity from raw water quality variables in the drinking water treatment industry. A large number of authors [Manel et al. 1999;Ozesmi and Ozesmi 1999;Paruelo and Tomasel 1997] have emphasized the interest of using ANNs instead of linear statistical models.

10

Chapter 2: Background and Related Work

Implementation of ANNs has been tried in several different fields of applied ecology. Examples include predicting brown trout density [Lek et al. 1996], predicting rainfall runoff [Anctil and Tape 2004], modeling trihalomethane residuals in water treatment [Lewin et al. 2004] and modeling of the oil extraction process from oil sands [Zhang et al. 2000]. One hidden-layer feed forward neural networks (Figure 2-1) trained by the well known Backpropagation (BP) algorithm [Haykin 1999] have been used in a wide spectrum of ecological applications. Some of the application domains for the BP algorithm are daily river flow forecasting [Danh et al. 1999], weekly nitrate-nitrogen (nitrate-N) concentration predictions [Momcilo et al. 2003], fish community composition prediction [Scardi et al. 2000] etc.

Figure 2-1: Feed Forward Network with one hidden layer.

The feed forward network is considered to be a universal approximator of any continuous function. A single hidden layer is preferred because it is generally sufficient for most applications; it greatly reduces computational time and produces similar results compared with multiple hidden layers.

2.2. Kohonen Self Organizing Feature Maps The Self-Organizing map (SOM), invented by Professor Kohonen [Kohonen 1990] is one of the most popular neural network structures based on competitive learning. It implements a nonlinear projection from the high-dimensional input space onto a lowdimensional network of neurons (usually a 2-dimensional grid) in an orderly manner. The SOM is based on unsupervised learning, which means that no “teaching output” is needed during the learning and that no assumptions are made about the distribution of the input data. The SOM can be used to extract features inherent to the data analysis problem and thus has also been called the Self-Organizing Feature Map.

11

Chapter 2: Background and Related Work

Figure 2-2: Kohonen Self Organizing Map showing the input and the output layer, modified from [Roussinov and Chen 1998]

The SOM consists of the input (data) layer and the output (map) layer. In the network each neuron of the input layer represents an input variable and has a weighted connection to each node of the output layer (Figure 2-2). The connection weights are adaptively changing at each iteration of the training algorithm whose steps are summarized below: 1. Initialize the weights i.e. the codebook vectors Wj = {wij; i = 1, 2…N} of each neuron j in the SOM to random values in the interval [0 1]. 2. The input for the Kohonen SOM is a dataset of vectors {X (t)}, each one consisting of N variables, X (t) = [ x 1( t ), x 2 ( t ), .........x N( t )] . 3. At presentation t, the input vector X(t) is compared with all the SOM neuron weights using some appropriate distance metric (e.g. the Euclidean distance) as shown below: N

di j( t ) = ∑ (xi( t ) − w i j( t ))

2

(2-1)

i =1

4. The neuron with the shortest distance to the input vector is declared as the winning neuron, also called the Best Matching Unit (BMU). 5. The weights of the BMU and its neighboring neurons are then updated to further reduce the distance between them and the input vector, as shown below. This has the effect of increasing the similarity of the presented data vector and the weights of the neighboring neurons, w i j( t + 1) = w i j( t ) + η( t ) N( t, r ) (x i( t ) − w i j( t )) (2-2) where η( t ) denotes the fractional increment of the correction and N( t, r ) is a time varying neighborhood function which determines the radius r from the BMU in the SOM i.e. the extent of the neurons which should be updated. The radius is set to a larger value early in the learning process to include a larger number of neurons in the update and is gradually reduced on reaching convergence. 6. Repeat steps 2-5 till convergence After several presentations of all input vectors (epochs) the end result of this unsupervised training algorithm is that the neurons on the low dimensional grid (output layer) become ordered i.e. neighboring neurons have similar weight vectors. 12

Chapter 2: Background and Related Work

Like a flexible net, the SOM learns a representation of the distribution of the input data folding onto the “cloud” formed by the input data during the iterative training. The training is usually performed in two phases. Relatively large initial learning rate and neighborhood radius are used in the first phase, to first tune the SOM approximately to the same space as the input data. In the second phase both learning rate and neighborhood radius are small right from the beginning, to achieve further fine-tuning of the SOM. Since one of the goals of using an SOM is to encode a large set of input vectors by finding a smaller set of prototype vectors that provide a good approximation to the original input vector distribution, the SOM is also considered as a vector quantization method, where the neuron weights play the role of the codebook vectors [Luttrell 1994]. Since the SOM is an approximation to the probability density function of the input data,it is widely applied for exploratory data analysis or data mining. The SOM’s output emphasizes the salient features of the data and subsequently leads to the automatic formation of clusters of similar data items. The data abstraction along with the enhanced visualization features justifies the use of SOM as a tool for data representation. Among the numerous applications of the SOM, we mention here biometric iris-based system [Lye et al. 2002], recognition of topographic patterns in EEG spectra [Joutsiniemi et al. 1995], reliability analysis in power systems [Luo et al. 2000], classification of documents [Hijikata et al. 1997] etc. A detailed list of practical applications involving SOM is provided in [Kohonen et al. 1996] So far, SOMs have rarely been used in ecology, though successful results have been documented. [Brosse et al. 2001] compared SOM with PCA when analyzing the spatial occupancy of several European freshwater fish species. [Giraudel and Lek 2001] compared the SOM with various ordination techniques and have shown that SOMs enable the visualization of the sampling units as well as of the distribution of species abundance data. Hybrid networks involving neural networks are currently being tested against the ecological data. An example of hybridization of the SOM for clustering the input vectors and the back propagation algorithm for predicting the biotic attributes is indicated in [Park et al. 2003] and [Gevrey et al. 2004]. [Park et al. 2004] use the Artificial Resonance theory, another self-organizing network, in the second step to find clusters in the neuron units of the SOM. Different levels of dissimilarity threshold in the ART were used for clustering in different scales. [Park et al. 2003] implemented a Counterpropagation Neural Network to pattern sampling sites and to predict the Species Richness and Shannon Diversity Index with the environmental variables. Many of the publications listed above have been carried out with “Predicting Aquatic Ecosystems Quality using Artificial Neural Networks” (PAEQANN), supported by the European Commission. PAEQANN advocates the use of ANNs as predictive tools to define effective policies and improve freshwater management and to potentially identify problems in an ecosystem’s functioning for a future restoration of its integrity [Paeqann 2004]. 13

Chapter 2: Background and Related Work

2.3. Ecological Ordination Techniques The main purposes of ordination in ecology are summarizing bulky data, interpreting the observed community structure in terms of the environment variables. In most instances, ecologists are interested in characterizing the main trends of variation in the data. This involves projecting the multidimensional scatter diagram onto bivariate plots where the axes are chosen to represent a large fraction of the variability of the data in a reduced space. These ordination approaches have been both endogenous and borrowed from other disciplines. The graphical results from most techniques often lead to ready and intuitive interpretations of species-environment relationships. Polar ordination [Bray and Curtis 1957] was the first widely-used ordination technique in ecology. Indirect gradient analysis utilizes only the species information to perform the ordination. Principal Component Analysis (PCA) [Legendre and Legendre 1998] and Correspondence Analysis (CA) [Hill 1974] are well known and widely-used ordination techniques, which use indirect gradient analysis in order to summarize the structure of ecological communities. Direct gradient analysis, in contrast, utilizes external environmental data in addition to the species data. In its simplest form, direct gradient analysis is a regression technique. The plots display species or community abundance in response to a known environmental gradient. Direct analysis allows us to test the null hypothesis that species composition is unrelated to measured variables. Redundancy Analysis (RDA) [Legendre and Legendre 1998] and Canonical Correspondence Analysis (CCA) [Ter Braak 1986] are the two most commonly used constrained ordination techniques in ecology.

2.3.1. Correspondence Analysis From a theoretical point of view, the behavior of a species along a gradient is approximately unimodal (Gaussian or bell-shaped distribution), as shown in Figure 2-3. As a result, the linear PCA model usually does not fit most species data. Correspondence analysis on the other hand, assumes a unimodal, rather than linear, relationship among the variables. [Hill 1974] was one of the first publications that introduced correspondence analysis, a technique originating in the 1930’s, to ecologists. CA allows a simultaneous display of the sampling sites and of the species. Correspondence analysis is a powerful method for the multivariate exploration of large-scale databases [Hill 1974;Tian et al. 1993]. CA preserves the chi-square (χ2) distance between the rows and columns of a contingency table [Legendre and Legendre 1998]. The chi-square distance between two rows i and i’ of a contingency table N(I,J) with I rows (i = 1,2,…I) and J columns (j = 1,;2, …J) having nij frequencies is computed below:

1 ⎛ n ij n i' j ⎞ ⎟ ⎜⎜ Dχ(i, i' ) = ∑ − n i' + ⎟⎠ j =1 n + j ⎝ n i + J

14

2

(2-3)

Chapter 2: Background and Related Work

where ni + = ∑ n ij and n + j = ∑ n ij represents the marginal frequencies. j

i

CA is an exploratory technique designed to find a multidimensional representation of the dependence between the rows and the columns of a two-way contingency table. The goal is to have a global view of the data as points in a low-dimensional space, such that the positions of the row and column points are consistent with their associations in the table. This representation is found by allocating scores to the row and column categories and displaying the categories as points, where the scores are used as co-ordinates of these points. There scores can be normalized in such a way that distances between row points and/or between column points in Euclidean space are equal to chi-square distances.

Figure 2-3: An example of a cumulative species response to an environmental gradient.

The computation of Correspondence analysis proceeds along three steps: 1. The contingency table is transformed into a table of contributions to the chisquare statistic after fitting a null model to the table. 2. Singular value decomposition (SVD) is applied to that table and the eigenvalues and eigenvectors are computed. 3. Further matrix manipulations lead to the tables required for plotting in ordination space. Because PCA is linear, data transformations are often necessary. Species abundance data usually have many zeros in the data matrix and no transformation can normalize them. The chi-square distance employed in CA excludes double zeros from the similarity estimation which is not the case in PCA, which uses the Euclidean distance [Legendre and Gallagher 2001]. CA is recommended for ordination when the data contain a large number of zeros. CA has been applied in various fields in ecology. It is primarily used to find associations of the biota in the spatial domain. [Greenacre and Vrba 1984] apply CA to represent the wildlife areas and the antelope tribes in a joint display. Similar association between sampling sites and phytoplankton taxa is highlighted in 15

Chapter 2: Background and Related Work

[Takamura et al. 2003]. SAS, SPSS, BMDP, and NCSS are 4 popular statistical packages for CA, compared in [Thompson 1995].

2.3.2. Canonical Correspondence Analysis The two most commonly used constrained ordination techniques in ecology are Redundancy Analysis (RDA) [Legendre and Legendre 1998] and Canonical Correspondence Analysis (CCA) [Ter Braak 1986]. RDA is basically a PCA in which the axes are restricted to be linear combinations of explanatory variables, but is inappropriate under the unimodal model of the species distribution. CCA is the constrained form of CA, and therefore is preferred for most ecological data. CCA, which couples Correspondence Analysis with regression methodologies, and provides for hypothesis testing [Ter Braak 1986], ushered in the biggest modern revolution in ordination methods. CCA is a direct gradient analysis technique combining CA with multiple regression whereby species composition is related to measured environmental variables [Ter Braak 1986;Ter Braak 1987]. CCA, along with CA, is a weighted average ordination technique. The main advantages of weighted averaging ordination include simultaneous ordering of sites and species, rapid and simple computation and very good performance when species have nonlinear and unimodal relationships to environmental gradients [Palmer 1993]. The results of CCA can be expressed in a triplot, i.e. a plot of sample scores, species scores, and environmental variable arrows [Ter Braak 1994;Ter Braak and Verdonschot 1995]. Sites and species are represented by points, while the environment variables are represented by arrows. Arrow length indicates the importance of the environment variable in the model, arrow direction indicates how well the environment variable is correlated with the CCA axes and the angle between the arrows indicates correlation. A much detailed explanation about the interpretation of the CCA triplot is given later, with the results in Chapter 3. Just as CA, there are several algorithms available to calculate CCA. [Ter Braak 1986] describes an algorithm based on reciprocal averaging that is employed by the popular FORTRAN based program CANOCO [Canoco 2004]. [Legendre and Legendre 1998] discuss an efficient numerical procedure involving CCA computations. Relevant examples are given in [Blair 1996;Ter Braak 1987;Yabe and Nakamura 2002].

2.4. Datasets Analyzed We selected the datasets from Maryland and Ohio primarily because of their coverage of the biological, chemical and physical domains of ecological modeling, which was a pre-requirement for the present research. These datasets, being of good size helped to test and fine-tune the conceptual methodology.

2.4.1. Maryland Biological Stream Survey The 1995-1997 MBSS includes data from 955 first, second and third-order stream segments, encompassing all 17 major drainage basins in the state of Maryland over the three-year sampling period. Lab Water Chemistry and Benthic Macroinvertebrates 16

Chapter 2: Background and Related Work

have been analyzed in spring (March-April) while Fish, Physical Habitat, and in-situ Water Chemistry have been analyzed in summer (June-September). All sampling sites are classified into three geographic regions: West, Central, and East. Biological measurements include abundance and health of fish, composition of benthic macroinvertebrate communities, and presence of amphibians and reptiles, aquatic plants, and mussels. Chemical measurements include pH, sulfate, nitrate-nitrogen, conductivity, dissolved oxygen, while physical habitat measurements took into account parameters such as flow, stream gradient, maximum depth, embeddedness, instream habitat, epifaunal substrate, pool and riffle quality, bank stability, channel flow status, shading, and riparian buffer type. The complete list of all the variables [Mercurio et al. 1999] used for analysis is provided for quick reference in Appendix A (Table A1). Statewide and basinwide results and an assessment of the condition of the streams have been reported in the MBSS three-year report [Roth et al. 1999].

2.4.2. Ohio EPA Dataset Ohio pioneered the integration of biosurvey data, physical habitat data, and bioassays with water chemistry data to measure the overall integrity of water resources. As the Ohio EPA realized the inability of traditional chemical monitoring to detect episodic pollution events or nonchemical impacts, biological monitoring provides the foundation of Ohio’s water programs. The Ohio dataset was assembled from the chemical, habitat and biological data collected by the Ohio EPA between 1995 and 2000 (July to September). The final dataset consisted of 1848 stations distributed over the entire state. Care was taken to ensure that most of the variables selected were adequately represented in the dataset with minimum missing values. Also, the time window for synchronising the sampling dates for the chemical and the biological samples at a particular station was selected to be a week before or after to capture the temporal effects of the chemicals on the biota. Since the physical habitat was sampled once a year for a sampling station, the habitat data were duplicated to be accommodated in the dataset. The complete list of variables used in this technical report from the Ohio data is given in Appendix B (Table B 1). In [OhioEPA 1999], various explanatory tools such as box-and-whisker plots, scatter plots and multivariate techniques like Principal Component Analysis were used to visualize regional patterns in nutrient concentration and relationships with biological performance parameters.

17

Chapter 3: Methodology and Results In this chapter we first introduce a modeling methodology that combines self organizing maps (SOM), an unsupervised neural network, and ecological ordination techniques, such as Correspondence Analysis (CA) and Canonical Correspondence Analysis (CCA) in an unconventional manner to provide the watershed manager with summarized information at various levels of detail. The SOM provides a data clustering and visualization technique in a low dimensional space. The SOM clustering results are superimposed with ecological ordination techniques to explore the associations between clustered sites, fish species and the environmental variables. In the latter half of the chapter, we present all steps of this type of data analysis and visualization for the state of Maryland, using the publicly available MBSS dataset.

3.1. A Modeling Methodology for Large Scale Data Analysis and Visualization The basis for this methodology is patterning of the sampling sites in the dataset based on similar metric characteristics. We decided to use the fish metrics as the input to the SOM and not the raw fish assemblages since the fish metrics are weighted averages of the raw data and hence are less susceptible to outliers. CA and CCA can then be carried out subsequently, to acquire more detailed information about the specific fish species and the role of the environment variables in affecting watershed quality. The basic methodology covering the essential blocks is given in Figure 3-1. The fish metrics data is normalized using log transformation and then linearly scaled between [0, 1]. Then the normalized fish metric vectors (one per sampling site) are used as input data to train a SOM in order to pattern the sampling sites. On the output map, those sites representing similar conditions, as judged by their similar fish metric information, are mapped closely together. We can visualize the SOM by drawing a Umatrix or a component plane representation. These provide us information about the correlations between individual components, division of data in the input space and relative distributions of the components. It is vital to find out if the data suggest the existence of SOM neuron clusters. If this is the case, these clusters need to be extracted to be able to fully exploit the properties of the data set by producing summary information. [Vesanto and Alhoniemi 2000] have shown, after experimenting with one real-world and two artificial data sets and using hierarchical agglomerative clustering and partitive clustering based on the k-means algorithm, that clustering SOM neurons, instead of directly clustering the data, is a computationally effective approach. We have used the k-means algorithm to cluster the SOM neurons optimally and to further extend our discussion on a per-cluster basis. Different visualization techniques were used based on the SOM to provide as much information as possible for management purposes. Various traits of the sampling sites, including the Ecoregion, the Stream Order, the Benthic IBI/Invertebrate Community Index (ICI) and the Physical Habitat Index (PHI)/Qualitative Habitat Evaluation Index (QHEI) were also summarized using the SOM clusters.

18

Chapter 3: Methodology and Results

Figure 3-1: Methodology showing how the SOM is combined with the Correspondence Analysis and Canonical Correspondence Analysis. The SOM is trained using the fish metrics vectors in order to pattern the sampling sites. The SOM neurons are then clustered using the k-means algorithm. For each SOM neuron the average (over all sampling sites patterned in that neuron) of the environmental variables and the fish species in the database are calculated. CA/CCA can then be used with the reduced datasets to provide summarized information.

Environmental variables related to water chemistry, physical habitat and land use, which were not used to train the SOM, were also mapped on the same SOM in order to shed light on the dependence of the fish metrics on these variables by comparing their maps. For each SOM neuron the average (over all sampling sites patterned in that neuron) of the environmental variables in the database are calculated. The fish assemblages are then log normalized before they are patterned on the SOM to account for the skewed distribution. After the mean (over all sites falling in a neuron) was calculated for each SOM neuron, the results were de-normalized and rounded to the nearest integer to realize the reduced species data set. In addition, by visualizing fish IBI on the SOM in a similar manner may clearly separate regions with good IBI from those with poor IBI and confirm the existence of groups of sampling sites with similar IBI patterns located possibly in different parts of the state. To summarize the gradient distribution of the fish species with respect to the SOM cells, Correspondence Analysis (CA) was carried out on the reduced species data. Canonical Correspondence Analysis was subsequently used to relate the environment variables to the fish species. Further, the clustering of the SOM is superimposed on the CA/CCA and the results are analyzed on a per-cluster basis. Combining CA and CCA with SOM based clustering provides watershed analysis on all the three hierarchy levels of fish information: the fish IBI, the fish metrics and the raw species 19

Chapter 3: Methodology and Results

data. This approach gives an overview of the prevailing conditions based on the fish metrics thorough the SOM while allowing us to fine-tune the existing information based on the fish species thorough the CA/CCA. Compared to the raw data traditionally used for CA/CCA, the problem size for the input matrices is considerably reduced and is now comparable to the number of SOM neurons. This problem size reduction is numerically advantageous, especially as the CA/CCA involves decomposition of the matrices via the singular value decomposition (SVD) [Legendre and Legendre 1998]. In addition to software developed in Matlab (Version 6.5), we have also utilized a public domain SOM Matlab toolbox [Vesanto 1999] developed at the Laboratory of Information and Computer Science, Helsinki University of Technology. The SOM toolbox provides Matlab functions that can be used to preprocess the data, initialize and train SOMs, and visualize the SOMs in various ways.

3.2. Analysis of Maryland Data Maryland represents a state with great geographical diversity extending from the Appalachian Mountains in the west to the beaches of the Atlantic Ocean in the east. This diversity is manifested in Maryland’s vast and intricate network of freshwater streams. Also, the state borders on the world’s most productive estuary, the Chesapeake Bay. The importance of the Chesapeake Bay in deciding the state of the waters in Maryland can be gauged from the fact that 16 of the 18 state river basins empty into the Bay. Attempts to improve the water quality in the state ultimately will decide the health and the future of the Chesapeake Bay. Geographically, Maryland can be divided into three broad areas: the Appalachian Plateau, Piedmont and the Coastal Plains. The Appalachian Plateau has expansive limestone-rich valleys, broadly sloping mountains and deep valleys, cut into the plateau by rivers. The Piedmont Plateau is composed of hard, crystalline igneous and metamorphic rocks and extends from the inner edge of the Coastal Plain westward to Catoctin Mountain, the eastern boundary of the Blue Ridge Province. The Coastal Plains split by the Chesapeake Bay is categorized by the slow-flowing streams and is separated from the rest of Maryland by the Fall Line. The Fall line, roughly along Interstate 95 represents the geographical area where the rivers descend from the hilly Piedmont to the flat and sandy Coastal Plain. The fish IBI was not calculated for sites with catchment area less than 300 acres. For the brooktrout and blackwater streams, fish IBIs less than 3.0 was not reported. However, these sites are incorporated for the sake of analysis and the possibility of studying the environmental characteristics at these unreported sites. Since the MBSS dataset did not include the individual metrics, ten (10) fish metrics (Table 3-1) are first calculated for each sampling site from the raw fish species data. Before the metrics related to native, benthic and the intolerant fish species are ranked, an adjusted value is calculated from the actual value taking into account the watershed area: Observed value Adjusted value = (3-1) m ∗ log(watershed area in acres) + b

20

Chapter 3: Methodology and Results

where m and b represent the slope and the intercept for the metric, based on linear regression between metric and log (watershed area in acres) [Roth et al. 2000]. A detailed enumeration of the metrics (Table 3-1) and the associated rankings is given in Appendix A (Table A2). Variable

Label

NUMNATIVE

Number of Native species

NUMBENTHIC

Number of Benthic fish Species

NUMINTOL

Number of Intolerant Species

PCTOL

Percentage Tolerant fish

PCDOM

Percentage of dominant species

PCGOI

Percentage of generalists, omnivores, and invertivores

PCINSECT

Percent insectivores

NUMINDVSQM

Number of individuals per square meter

BIOPSQM

Biomass (g) per square meter

PCSPAWN

Percent lithophilic spawners

Table 3-1: Metrics used to compute the Fish IBI in the MBSS [Roth et al. 2000].

Since sites with small watershed areas are included, the adjusted values also contain negative values. In the next section we present results and a detailed discussion on the various steps involved in the methodology.

3.2.1. Determining the Size of the SOM The size of the SOM map (number of output neuron units) has a strong influence on the quality of the clustering. If the selected map size is too small, it might miss to explain some important differences present in the data. Conversely, if the selected map size is too large, the differences may become too small to detect. Typically two quality criteria are used: resolution and topology preservation, assessed via the quantization and topographic errors [Kohonen 2001]. The quantization error is defined as the mean of the Euclidean distance of each data vector to its BMU’s weight vector and measures map resolution. The topographic error [Kiviluoto 1996] is calculated as the proportion of all data vectors for which first and second BMUs are not adjacent units in the grid. We decided to have a small yet optimal map size for the SOM, keeping in mind the size of the dataset and the resultant number of clusters. The optimum map size was decided after considering both the topographic and the quantization errors. Since a very high map size is undesirable, we decided to use 12 X 5 = 60 neurons, a size that minimizes the topographic error (0.0136) while also resulting to a very small quantization error (0.1939) (Figure 3-2). The SOM training was done with 20 epochs in the rough training phase and 100 epochs in the finetuning phase.

21

Chapter 3: Methodology and Results

Figure 3-2: Quantization and topographic errors as a function of the SOM map units. We selected 60 = 12 X 5 as the optimal number of map units.

3.2.2. The U Matrix Representation of the SOM

Figure 3-3: Representation of the SOM U matrix. The inter-unit values are the Euclidean distances between adjacent map units. The levels of gray shown inside a specific unit are found by taking the median of the surrounding gray level values. The U matrix visually suggests the presence of three groups of neurons (clusters).

An initial impression of the number of neuron clusters present on the SOM and their spatial arrangement is usually acquired by visual inspection of the map. The U-matrix [Ultsch and Siemon 1990] is a representation of the SOM that helps visualizing inter22

Chapter 3: Methodology and Results

neuron distances while also revealing the cluster structure of the map. The matrix contains the distances from each unit center to all of its neighbors in the low dimensional grid. For effective visualization and easier interpretation, a gray scale can be chosen (Figure 4). A dark (light) coloring between the neurons corresponds to a large (small) Euclidean distance and thus a large (small) gap between the codebook values in the input space. Therefore, light areas can be thought as neuron clusters and dark areas as cluster separators. For the MBSS fish data the calculated U matrix gives the visual impression that there exist three neuron clusters: one covering the bottom-right region, another covering the top 2 rows and a third one concentrated on the middle region.

3.2.3. Clustering of the SOM Neurons To determine the SOM neuron unit clusters, techniques such as k-means or hierarchical clustering [Vesanto and Alhoniemi 2000] can be applied to the trained SOM.

Figure 3-4: The Davies-Bouldin index used in conjunction with the k-means algorithm to find the optimal number of clusters present in the SOM. The index achieves a minimum (optimal) value for 3 clusters.

The Davies-Bouldin index [Davies and Bouldin 1979], which is a function of the ratio of the sum of within-cluster scatter to the in-between-cluster separation, can be used for selecting the optimal number of clusters. The Davies-Bouldin index for c clusters can be calculated as ⎛ S (Q ) + Sc (Q l ) ⎞ 1 c ⎟⎟ DB(c) = ∑ max ⎜⎜ c k (3-2) c k =1k ≠ l⎝ d (Qk, Ql) ⎠ SC(Qk) represents the intra-cluster distance (dispersion) of cluster Qk, defined as the sum of all the cluster members xi and the centroid ck, divided by the number of cluster members Nk

23

Chapter 3: Methodology and Results

Sc =

1 ∑ x i - ck Nk i

(3-3)

The inter-cluster distance d (Qk, Ql) between two clusters Qk and Ql, defined as the distance between the two cluster centroids, is given by:

d (Qk , Ql) = ck - cl

(3-4)

The Davies–Bouldin index is suitable for evaluation of k-means partitioning because it gives low values, indicating good clustering results, for spherical clusters (Figure 3-4). The k-means clustering was performed on the SOM codebook vectors with 100 iterations. Figure 3-4 suggests the existence of 3 clusters. Figure 3-5 shows the spatial distribution of the sampling sites falling within each cluster after applying the kmeans algorithm.

Figure 3-5: k-means clustering of the SOM neurons and spatial distribution of sites in each cluster for Maryland. The bottom left panel shows the result of the k-means clustering of the SOM neurons after 100 iterations, and largely agrees with the U matrix (Figure 3-3). Cluster 1 sites are concentrated in the western part of the state. The coastal area sites are mostly in Cluster 3.

The sampling sites in Cluster 1 fall roughly in one of the following two groups: Group 1 in the Youghiogheny basin in the west, and Group 2 in the Middle Potomac Basin in central Maryland. Cluster 2 is predominant in the Piedmont Province. The Coastal Plains are mostly populated with sites that belong to Cluster 3. One major difference between the Coastal Plains and the other physiographic provinces in Maryland is the response of streams to organic enrichment [Roth et al. 1999]. Because of the lower gradient and naturally limited capacity to mechanically aerate the water and replace 24

Chapter 3: Methodology and Results

oxygen lost via biochemical oxygen demand (BOD), streams in the Coastal Plain more often tend to become more overenriched than elsewhere in the State. To draw some conclusions about the clusters, we applied the cluster information to the ecoregion (Figure 3-6) and the stream order (Figure 3-7) in Maryland. Ecoregions [Omernik 1987] are considered useful units for management because they are relatively homogeneous with respect to their structural components and dominant ecological processes. Maryland consists of six Level III ecoregions: Middle Atlantic Coastal Plain, Northern Piedmont, Southeastern Plains, Blue Ridge, Ridge and Valley and Central Appalachians (Figure 3-6). Ecoregion 63-Middle Atlantic Coastal Plain 64-Northern Piedmont 65-Southeastern Plains 66-Blue Ridge 67-Ridge and Valley 69-Central Appalachians

Cluster 1 1.41% 49.01% 2.02% 69.23% 52.94% 51.90%

Cluster 2 16.90% 39.01% 45.96% 7.69% 15.44% 26.58%

Cluster 3 81.69% 11.89% 52.02% 23.07% 31.62% 21.52%

Table 3-2: Distribution of ecoregions in each cluster in Maryland. The percentages indicate the proportion of the sites in each ecoregion for each cluster.

Figure 3-6: Distribution of the 6 Level III Ecoregions in Maryland. The vertical axis represents the total sites in Maryland for each category. The barplots have been arranged from left to right, designated as Overall (all the 955 sampling sites in Maryland), Cluster 1, Cluster 2 and Cluster 3. The numbers in front of each legend indicates the corresponding Ecoregion number. The numerical results are tabulated in Table 3-2.

25

Chapter 3: Methodology and Results

The Middle Atlantic Coastal Plains is effectively contained in Cluster 3 (81%). Northern Piedmont is equally distributed across Cluster 1 (49%) and Cluster 2(39%), with Cluster 1 occupying the western part and Cluster 2 covering the eastern region (Figure 3-5). Similar result is obtained for Southeastern Plains for Cluster 2 (45%) and Cluster 3 (52%). We also note the lack of sites from the ecoregions in West Maryland - Blue Ridge (7%), Ridge and Valley (15%) and Central Appalachians (26%) in Cluster 2.

Figure 3-7: Distribution of the Stream Order in Maryland. The vertical axis represents the total sites in Maryland for each category. The barplots have been arranged from left to right, designated as Overall (all the 955 sampling sites in Maryland), Cluster 1, Cluster 2 and Cluster 3. The results are tabulated in Table 3-3.

Strahler's stream order system is a simple method of classifying stream segments based on the number of tributaries upstream. Sampling across stream order was uniformly distributed throughout the entire state. Clusters 1 and 2 behave similarly with respect to the distribution of the stream order. Cluster 3 contains more than 50% (57%) of the first order streams and less than 20% (18%) of the third order streams. Stream Order 1 2 3

Cluster 1 19.08% 40.24% 41.08%

Cluster 2 23.70% 33.33% 40.74%

Cluster 3 57.23% 26.43% 18.19%

Table 3-3: Distribution of the stream order in each cluster in Maryland. The percentages indicate the proportion of the sites in each cluster for a particular stream order.

26

Chapter 3: Methodology and Results

3.2.4. Relationships Variables

between

Biological

and

Environmental

This is the most important part of the analysis from a watershed manager’s point of view. The following visualizations aim to provide, through the SOM, both a snapshot of quantitative analysis results as well as a summary of the effects that several environmental variable gradients have simultaneously on biological integrity indicators, such as the fish metrics and the fish IBI. 3.2.4.1. Visualization of Fish Metric Gradients on the SOM Each input variable has a different weighted connection to each and every one of the SOM neurons (Figure 2-2). Since the value of this weight models the influence of an input variable to the activity of each SOM neuron, the distribution of each fish metric variable on the SOM (Figure 3-8) can be inferred from the corresponding vector of weights. Component plane representation can be thought as a sliced version of the Self-Organizing Map corresponding to a particular component.

Figure 3-8: Fish metric component planes visualized on the SOM (left) and the corresponding boxplots (right) for the individual clusters. There is a clear distribution gradient in most of the metric components. The ranges for the metrics are shown in the corresponding colorbar. The metrics abbreviations correspond to those listed in Table 3-1.

27

Chapter 3: Methodology and Results

Boxplots have been drawn for each of the fish metrics to visually estimate their distribution in the clusters, as well as the presence and position of outliers. The box extends from the lower quartile to the upper quartile values, with a line in the middle for the median. The whiskers are lines extending from each end of the box to a maximum of 1.5 times the interquartile range to show the extent of the rest of the data. Values beyond the ends of the whiskers are considered outliers (shown as crosses). We observe that the metrics related to the percentage of Insectivores (PCINSECT) and the percentage of the group of Generalists, Omnivores and Invertivores (PCGOI) exhibit gradients in opposite directions. The metrics for the number of Native species (NUMNATIVE) and intolerant species (NUMINTOL) are mirror images of each other. Going from Cluster 1 to Cluster 3 in the boxplots, we see a gradual increase in the values for the metrics linked with tolerant species (PCTOL), while there is a gradual decline in the values for the metrics related to the Benthic species (NUMBENTHIC) and fish density (NUMINDVSQM). The metrics related to Native species and Benthic species, which are associated with species richness, are expected to decrease in value in response to anthropogenic stress. Because many benthic fishes have relatively limited home ranges, they are potentially valuable indicators of local conditions. The density of individual fish count (NUMINDVSQM) and the biomass (BIOPSQM) are an indication of the overall fish abundance and these metrics decrease with increase in stress. The percentage of individuals belonging to the dominant taxa (PCDOM) in the fish community is likely to increase as the amount and extent of degradation increases. The relative abundance of tolerant habitat generalists (PCTOL) also follows a similar trend. Based on the above observations, it can be inferred just from Figure 3-8 that the map units in the top right portion of the SOM include sampling sites with relatively high levels of degradation. The effect of anthropogenic stress is less severe in the sampling sites belonging to neurons in the lower portion of the SOM. 3.2.4.2. Visualization of the Fish IBI Once the metrics were visualized on the SOM, the next step was to understand the net effect of the metrics thorough the fish IBI on the SOM. The range for the fish IBI, taken as the average of the ranked fish metrics extends from 1.0 (for very poor, highly degraded streams) to 5.0 (for good pristine streams). Since the fish IBI values are available for all sites in the database, the mean value of the fish IBI (IBIm) over all sampling sites patterned in the same SOM neuron was calculated as follows: IBI m =

1 n ∑ ibi n i =1 i

(3-5)

where n is the number of input vectors (sampling sites) assigned to each output neuron of the trained SOM, and ibii is the value of fish IBI of sampling site i. Based on the Fish IBI distribution and the clustering results (Figure 3-9), we see that Cluster 1 neurons, concentrated in West Maryland, have high average IBI values (3.7662 ± 0.4948). Cluster 2 (3.1405 ± 0.7551) and Cluster 3 (2.5419 ± 1.0814) have neurons with sites of an average to low fish IBI mean values. Cluster 3 in particular has a higher percentage of first order streams (Figure 3-7) compared to the other clusters. This indicates degraded conditions in first-order streams, or may reflect the

28

Chapter 3: Methodology and Results

tendency of the IBI to underrate small streams, even though watershed size is accounted for in the calculation of the fish IBI [Roth et al. 1999]. The following two figures (Figure 3-10-Figure 3-11 ) demonstrate the efficiency of the SOM in analyzing the distribution of the fish IBI through the patterning.

Figure 3-9: Distribution of the fish IBI on the SOM (left) and the corresponding boxplots (right) for the individual clusters. Overall indicates that the data represents all the SOM neurons which cover data from the entire state. The low values of the Fish IBI are concentrated in the top right area of the SOM (mostly in sites belonging to neurons of cluster 3).

29

Chapter 3: Methodology and Results

Figure 3-10: Distribution of the fish IBI in Maryland. The averaged fish IBI values on the SOM are duplicated for the sampling sites falling in the same SOM neuron and reproduced on the map.

The average fish IBI computed in Equation (3-5) for each SOM neuron is replicated for the sampling sites and reproduced on the state map (Figure 3-10). The Patapsco, the Potomac Washington Metro and the eastern section of the Upper Potomac basin and the North Branch Potomac basins have a large number of sites rated as poor. Figure 3-11 summarizes the distribution of the IBI through the SOM. The Mean Square Error between the averaged IBI value and the actual fish IBI over all the 955 sampling sites was found to be 0.3412 over the entire range of IBI values.

30

Chapter 3: Methodology and Results

Figure 3-11: Comparing the actual IBI values with the SOM averaged fish IBI in Maryland. The Mean Square Error (MSE) was calculated. Each frame represents a SOM cell neuron with the label indicated at the bottom right of the frame (in steps of 5). While the scatter points in each frame represent the observed fish IBI of the sites falling in that SOM neuron, the horizontal line in the frame represents the averaged IBI value of those sites.

3.2.4.3. Visualization of Benthic IBI and Physical Habitat Index (PHI) Less noticeable to the general public than fish, benthic (or bottom- dwelling) creatures are essential to the functioning of aquatic ecosystems, including providing much of the food for other species. They are particularly sensitive to changes in water quality and physical habitat. The Benthic IBI has a range similar to the fish IBI, from 1.0 (for poor streams) to 5.0 (for streams in excellent conditions). The results of the patterning of the benthic IBI on the SOM and the resultant boxplots are shown in Figure 3-12 . Most of the benthic IBI values occur in the mid range.

31

Chapter 3: Methodology and Results

Figure 3-12: Distribution of the Benthic IBI on the SOM (left) and the corresponding boxplots (right) for the individual clusters. The different sizes of the neurons indicate the associated cluster for the neuron, whereby the largest (smallest) size indicates Cluster 1 (Cluster 3). Overall indicates that the data represents all the SOM neurons which cover data from the entire state. The results are similar to the fish IBI (Figure 3-9)

There was not much correlation between the fish IBI and the benthic IBI (ρ = 0.5569). One of the reasons could be the different responses to stressors by the fish and the benthic organisms. Fish are more mobile and are better adapted to avoid a stress in the water quality. Benthos may be more directly affected by habitat degradation that leads to sedimentation and movement of unstable substrates [Roth et al. 1999] In addition, the fish IBI was compared with the Physical Habitat Index (PHI). PHI contains embeddedness and most habitat parameters but is a summation of effects. Therefore the overall PHI may not have an impact as profound as some important metrics such as embeddedness. The PHI is calibrated on a scale of 0 to 100, 0 being an indication of very poor habitat and 100, an indication of good habitat. A significant positive correlation (ρ = 0.7059) was found between the PHI and the fish IBI over all the 60 SOM neurons (Figure 3-13). Overall, most of the sampling sites (50%) were rated as fair (with PHI scores between 42 and 71.9).

32

Chapter 3: Methodology and Results

Figure 3-13: Distribution of the PHI on the SOM (left) and the corresponding boxplots (right) for the individual clusters. The different sizes of the neurons indicate the associated cluster for the neuron, whereby the largest (smallest) size indicates Cluster 1 (Cluster 3). Overall indicates that the data represents all the SOM neurons which cover data from the entire state.

3.2.4.4. Visualization of Environmental Variable Gradients on the SOM To understand relationships between biological and environmental variables, environmental variable vectors were introduced into the trained SOM in a manner similar to the fish IBI. The mean value of each environmental variable (Em), over all sampling sites patterned in the same SOM neuron, was calculated as follows: Em =

1 n ∑e n i =1 i

(3-6)

where n is the number of input vectors (sampling sites) assigned to each output neuron of the trained SOM, and ei is the value of environmental variable e of input vector i. We separated the figures based on the domain of the variables: Water chemistry (Figure 3-14), Physical Habitat (Figure 3-15) and Land Use (Figure 3-16). We see a high correlation between dissolved oxygen (DO_FLD) and pH (PH_FLD, PH_LAB), although this correlation cannot be explained. From the boxplots in Figure 3-14, we see that the distribution of nitrates (NO3_LAB) is fairly at the same level across all the 3 clusters, although the median (> 2 mg/l) is elevated, an indication of anthropogenic influence such as agricultural runoff, wastewater discharge and urban nonpoint sources. In a natural unimpacted stream NO3 would be 0.95) with each other and were eliminated. In particular, the variables Low Urban Use (LOWURB) 42

Chapter 3: Methodology and Results

was highly correlated with Urban Use (URBAN), Emergent Wetlands (EMERGWET) with Wetlands (WETLANDS) and Transitional Land use (TRANS) with Barren Land Use (BARREN) and were not considered later in the analysis. CCA was applied to the dataset with reduced species (61) and environmental variables (47) pertaining to the SOM neuron sites (60).

Water Chemistry

Cluster 1

Cluster 2

Cluster 3

DO_FLD

ANC_LAB

DOC_LAB

PH_FLD

COND_FLD

NO3_LAB(1)

PH_LAB

COND_LAB

SO4_LAB TEMP_FLD

Physical Habitat

AESTHET

CH_FLOW(1,2)

AVGTHAL(2)

AVG_VEL

POOLQUAL(1,2)

AVGWID(2)

BANKSTAB

EMBEDDED

CHAN_ALT

MAXDEPTH(2)

EPI_SUB

NUMROOT(2,1)

FLOW

REMOTE

INSTRHAB

RIP_WID

RIFFQUAL

WOOD_DEB

SHADING(3) ST_GRAD VEL_DPTH(2)

Land Use

AGRI

COALMINE

ACREAGE(2)

DECIDFOR

HIGHURB(3)

BARREN

FOREST

PASTUR(1,2)

CONIFER

ROWCROP

URBAN(3)

MIXEDFOR

WATER(2,1)

PROBCROP(2)

WETLANDS

WOODYWET

Table 3-8: Association of environmental variables with k-means based neuron clusters. Each variable is listed in the column corresponding to the cluster with the largest median projection on that variable’s line. Bold text highlights variables for which both methods (SOM and the SOM combined with CCA) confirm a strong association with a cluster. For the other variables, the numbers in parenthesis indicate the SOM assigned cluster association (Table 3-5). Furthermore, if a variable achieves maximum per cluster median value in a border neuron, the neighboring cluster number is also noted.

The results of CCA can be expressed in a triplot, i.e. a plot of sample scores, species scores, and environmental variable arrows [Ter Braak 1994;Ter Braak and Verdonschot 1995]. Sites and species are represented by points. Arrows for the environmental variables point in the direction of maximal variation in the value of the corresponding variable. Environmental variables deemed important are represented by longer arrows than less important ones. The projection of the site scores on the environment variable arrow indicates the preference of the site to either higher than average values, if the score is on the same side of the origin as the environmental variable arrow, or lower than average values, if the origin is between the score and the environment variable arrow. Lines may be extended in both directions from the origin of the plot to get the projections of the site and species scores on the environmental variables.

43

Chapter 3: Methodology and Results

The CCA plot is presented in Figure 3-20 with the corresponding SOM cluster information superimposed to be able to draw some interesting conclusions. From Figure 3-20, the orthogonal projection of all the SOM neuron scores on the environment variables was calculated. The projections are grouped together, taking into account the cluster label assigned to each SOM neuron by the k-means clustering and the median for each cluster (designated as per cluster median projection) is calculated to summarize the statistical distribution for the cluster. Since the magnitude of the projection models the deviation of the environmental variable from its overall mean value, finding the per-cluster median projection provides a statistical measure of how a variable affects the sites of a cluster. In Table 3-8 we placed each environmental variable in the column corresponding to the cluster for which the percluster median projection on that variable’s arrow is the largest, compared to the other clusters. (Refer Appendix A: Table A 5 for actual calculations)

Figure 3-20: Canonical Correspondence Analysis plot showing the environment variables scaled by a factor of 4 and the scores of the k-means clustered SOM neurons in three different shades of gray. The first two axes account for 32% and 16% of the variation respectively. The variable-neuron cluster associations are indicated according to Table 3-8. See more details in text.

We observe that SOM and SOM combined with CCA based analysis suggest the same dominant cluster association for 32 out of the 47 environmental variables (68%). For the remaining 15 variables, 11 are associated mostly with cluster 2 (of intermediate IBI) based on CCA. It is interesting to observe, however, that 5 of them are also shown to achieve a maximal per cluster median value in cells of cluster 1, or cluster 3, bordering cluster 2 (see Table 3-8 and Figure 3-18). This suggests that the

44

Chapter 3: Methodology and Results

disagreement of the two methods may be used as the basis for detecting the boundaries of neuron clusters with different IBI patterns. The variables pertaining to stream gradient (ST_GRAD), flow (FLOW), temperature (TEMP_FLD), acreage (ACREAGE) and land use associated with deciduous forests (DECIDFOR), and woody wetlands (WOODYWET) are highly correlated with the first (horizontal) canonical axis and are important parameters as perceived by the CCA. The secondary, vertical gradient of the CCA is related to parameters explaining forest (FOREST), agriculture (AGRI) and urban land (URBAN) use, lab conductivity (COND_LAB), nitrate-nitrogen concentrations (NO3_LAB) and use of coal mines (COALMINE). The presence of COALMINE could be attributed to the coal mining in the Appalachian Plateau which has severely impacted many streams, especially in the North Branch Potomac region. The direction of the environmental variable arrows indicates the inter-correlations between variables. Variables headed in the same direction are strongly correlated whereas variables with lines pointing in the opposite direction have a negative correlation. Lines with an angle of 90 degrees indicate that the two variables are uncorrelated. Based on Figure 3-20, an example of positive correlation would be the presence of high embeddedness (EMBEDDED) in barren lands (BARREN), possibly due to erosion and sediment input. Coalmines (COALMINE) normally have high sulphate content (SO4_LAB). On the other hand, high dissolved organic carbon levels (DOC_LAB) and associated high biochemical oxygen demand, caused by organic enrichment result in low oxygen concentrations (DO_FLD) in the stream substrate, indicating negative correlation. High concentrations of dissolved organic carbon (DOC) in freshwater, accessible to bacteria and other microorganisms as a source of energy may indicate pollution by anthropogenic sources. Conditions for Agricultural land (AGRI) are contradictory to ones for barren land (BARREN). Wetlands (WETLANDS) are normally dense and enclosed by vegetation. This is shown by the positive correlation between WETLANDS and SHADING and negative correlation between WETLANDS and TEMP_FLD. Another relationship involving temperature is its negative relation with dissolved oxygen (DO_FLD). Warmer water becomes saturated more easily with oxygen. We are presenting two more cases highlighting the complex interrelationships between the various variables across all domains. Woody wetlands (WOODYWET) are dystrophic, which means they do not have enough nutrients and are low in dissolved oxygen (DO_FLD) and pH (PH_LAB, PH_FLD). The organic production and decomposition in woody wetlands (as in any wetland) are intensive and putting great demand on oxygen (DO) and, in the absence of DO, organic acids are produced. The impact of low DO on fish is well known. Low pH makes pollutants such as metals more toxic. Embeddedness refers to the extent to which gravel, cobble, or boulders are surrounded or covered by fine material (silt or clay). High velocity (AVG_VEL) does not allow these fine particles to settle, resulting in low degree of embeddedness (EMBEDDED). Based on the information in Figure 3-20, it is possible to highlight associations suggested by the data between variables and clusters. The primary (first) canonical 45

Chapter 3: Methodology and Results

axis effectively segregates Cluster 1 and 3, thereby pointing out the variables highly affecting the fish IBI. Oxygen concentrations (DO_FLD), pH (PH_LAB, PH_FLD) and stream gradient (ST_GRAD) amounts to good conditions for the overall quality of the watershed, resulting in good fish IBI in Cluster 1. All the habitat quality parameters (RIFFQUAL, CHAN_ALT, INSTRHAB, BANKSTAB, SHADING and AESTHET) have been grouped in Cluster 1, which agrees with their basic definition, whereby high values indicate pristine environments for the fish. The fish IBI has a positive relationship with agricultural land use (AGRI), even though nitrate-nitrogen concentrations were three times higher in areas with over 50% agriculture (Roth et al 1999). One reason could be the decrease in urban land cover as the agricultural land increases. [Blankenship 2000] hypothesized that the reasons could be the increasing food supply for the fish due to nutrients in agricultural lands or the displacement of the fish community from a nearby developed region to agricultural lands. On the other hand, high temperature (TEMP_FLD), organic carbon (DOC_LAB), embeddedness (EMBEDDED), barren land (BARREN) and woody wetland (WOODYWET) use are some of the variables which increase the degradation, causing the fish IBI to be below optimal values. Variables in Cluster 2 have a large projection on the second canonical axis. High Conductance (COND_LAB, COND_FLD), coalmines (COALMINE) and pastures (PASTUR) cause conditions between the two extremes in the watershed with the fish IBI around the average level. The results obtained using only the SOM or combining SOM based clustering with CA/CCA are consistent and complement each other quite well.

3.2.7. Discussion The results provide a comprehensive state-based study for future monitoring that can address short-term and long-term trends. Different visualization approaches, based on a detailed analysis give the watershed manager ample scope for working out new strategies, focused on improving the overall heath of the streams today. The methodology aims to quantify the extent to which the environmental variables may be affecting the critical biological resources in the state. Information obtained from this research could be used to support and initiate policies for watershed restoration. One of the major highlights of this chapter is the flexibility of the SOM to enable effective visualization of highly complex multi-dimensional ecological data sets. The SOM analysis provides an ordered set of sites across the state with similar metric characteristics, ultimately reflecting in the fish IBI. The clustering analysis allows us to divide the examination of the entire state into 3 clusters, which helps us to compare the clusters and define their key ecological characteristics. The various boxplots in the figures allow us to understand the statistical distribution of the variables in the SOM. CA and CCA help us in drawing conclusions about specific fish species and the role of the environmental variables in maintaining the perfect abode for fishes. The association between the environmental variables and the clusters results in summarizing the gradient distribution in the clusters. In Figure 3-21, we have indicated the top 20 environment variables in order of importance as they explain the variation in the distribution of the fish species, based on the length of each arrow in Figure 3-20. Woody Wetlands, with a high collinearity with the first canonical axis (Figure 3-20) has an important role in deciding the fish composition, which will ultimately affect 46

Chapter 3: Methodology and Results

the fish IBI. The negative impacts of the woody wetlands on fish have been discussed in Section 3.2.6. Embeddedness, channel alteration and stream gradient are some of the habitat variables which have been known to have an impact on the biotic integrity. Lands used for the production of row crops which could be a surrogate for agricultural lands are an important stressor impacting stream degradation in Maryland. The positions of dissolved oxygen and pH in Figure 3-21 reaffirm the importance of these variables in maintaining the fish community in streams.

Figure 3-21: Comparative ranking of the environment variables based on the length of the arrow of the environment variables in Figure 3-20.The length of the arrow represents the importance of each variable in explaining the variation in fish species distribution. Only the top 20 variables have been shown. The labels are indicated in Appendix A: Table A1.

[Roth et al. 1999] have highlighted the need for inclusion of more stressors in the MBSS dataset especially related to nutrients to understand better the geographic scales of the problems plaguing the streams in Maryland today. The second round of MBSS (2000-2004) will address some of the questions brought to the fore, based on the analysis by [Roth et al. 1999]. We hope to include the dataset in the analysis to obtain finer conclusions and to visualize the temporal trend of the stressors affecting the streams in Maryland.

47

Chapter 4: Analysis of the Ohio EPA Dataset Located west of the Appalachian Mountains and bounded by Lake Erie to the north and the Ohio River to the east and south, Ohio has more than 700 miles of navigable waterways. Eighty percent of the state’s land area drains into the Ohio River, which forms its southern boundary and for which the state was named. The rest of the streams drain into Lake Erie, the smallest of the Great Lakes in volume. The southeastern quarter of the state belongs to the Appalachian Plateau. Ohio shares parts of two major physical provinces of the continental United States-the Appalachian Plateau and the Central Lowland. The boundary between these regions cuts the state in two along a northeast-southwest line extending from southwest of Cleveland to the Ohio River. The two regions are distinguished by their relief and elevation, with higher, more rugged land in the plateau areas and less elevated, level terrain in the Lowland province. In western Ohio limestone and dolomite are widespread. Toward the east sandstones and shales are more prevalent. The northwest's field crops of corn and soybeans are typical of the agricultural economy in the Corn Belt, while southeastern Ohio has the general mixed-farming economy consisting of cattle grazing and minimal crop production. The Ohio EPA has formulated a list of 12 Metrics modified from the ones proposed by [Karr 1981] based on the type of sites: Headwaters, Wading and Boat sites. Each type has a set of Metrics to calculate the fish IBI [OhioEPA 1987]. The complete list of the Metrics separated by the type of sites in Ohio and the associated scoring criteria is given in Appendix B (B2). For the sake of the analysis, the following 12 Metrics with their abbreviations are given below in Table 4-1. Variable

Label

SPSCORE

Total Species Metric Score

DADSRNSCORE

Darter/Sculpins Metric Score

SUNSCORE

Sunfish Metric Score

SUMINSCORE

Sucker/Minnow Metric Score

INTSCORE

Intolerant/Sensitive Species Metric Score

TOLSCORE

% Tolerant Metric Score

OMNISCORE

% Omnivores Metric Score

INSSCORE

% Insectivores Metric Score

TPIOSCORE

% Top Carnivore/Pioneering Metric Score

NUMSCORE

Fish Abundance Metric Score

SPWNSCORE

%/Number Simple Lithophils Metric Score

DELSCORE

Delt Anomalies Metric Score

Table 4-1: Metrics used to compute the Fish IBI in the Ohio EPA [OhioEPA 1987].

The formulated Ohio EPA dataset consisted of 1848 sampling sites spread over the entire state. The period of interest for this research was 1995 to 2000. The formation of this subset dataset has been explained in Section 2.4.2. The complete information regarding this dataset has been included in Appendix B.

48

Chapter 4: Analysis of the Ohio EPA dataset

4.1. Determining the Size of the SOM The results presented below follow the same methodology proposed in Chapter 3 and further give scope for comparisons between the two datasets. In the case of Maryland (Section 3.2), we had formed the metric input by combining the raw metric values at each sampling site for processing. This is not possible in the case of Ohio. The metric related to darters (DADSRNSCORE) combined a number (Number of Darter species at Headwater and Wading sites) and a percentage (Percent Round-bodied suckers at boat Sites). The same was true also for SPWNSCORE, where the metric was composed of a number of simple lithophils species for headwaters and percentage of simple Lithophils for Wading and Boat sites. Since it would seem illogical to combine a number and a percentage to represent the same metric, we used the metric scores (ranked as 1, 3, and 5) to form the input to the SOM. Data containing the 12 Metric scores for each sampling site were used as input into the SOM and the topographic and the quantization error (Figure 4-1) were used to decide on the number of map units for the SOM.

Figure 4-1: Quantization and topographic errors as a function of the SOM map units. We selected 45 = 9 X 5 as the optimal map size.

The SOM training was done in 20 epochs in the rough training phase and 100 epochs in the finetuning phase. The resultant SOM has a map size of 9 X 5 = 45 SOM neurons. The associated errors have a value of 0.8515 (quantization error) and 0.02 (topographic error).

4.2. The U Matrix Representation of the SOM The U Matrix was constructed from the SOM considering the Euclidean distances across the 12 fish Metrics. Although there are no clear clusters visible, as compared to the U matrix in Maryland (Figure 3-3), there appears to be isolated cluster boundaries splitting the top half and one at the bottom. 49

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-2: Representation of the SOM U matrix.

4.3. Clustering of the SOM Neurons K-means clustering with 100 iterations using the Davies-Bouldin index was used to help us decide on the number of clusters of cells present in the SOM. Based on the U matrix (Figure 4-2) and the k-means results (Figure 4-3), we decided that the SOM should be partitioned into 3 clusters. Figure 4-4 shows the results of the SOM partitioning and the spatial distribution of each cluster in Ohio. Again, the results are different compared to Maryland (Figure 3-5). The SOM clusters do not show a similar relationship at the spatial level. Instead, we have the clusters distributed across the entire Ohio state map (see Figure 4-4).

50

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-3: The Davies-Bouldin index used in conjunction with the k-means algorithm to find the optimal number of clusters in the SOM. We selected 3 clusters.

From Figure 4-4, it can be interpreted that the sampling sites around the northern border are dominated by Cluster 3.

51

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-4: k-means clustering of the SOM neurons and spatial distribution of sites in each cluster for Ohio. The bottom right panel shows the result of the k-means clustering of the SOM neurons after 100 iterations. The clusters are distributed across the entire state.

Ohio contains portions of five ecoregions- Huron-Lake Erie Plain, Erie-Ontario Lake Plain, Eastern Corn Belt Plain, Western Allegheny Plateau, and Interior Plateau. The landforms, soils, and land use in Ohio’s five ecoregions are quite varied, ranging from the forest-covered hills (Ohio Hills) in the southeast to the level, heavily agricultural west (Till Plains). Cluster 3 has a high concentration of the sites in the Huron/Erie Lake Plains (58%). Cluster 1 has a sizeable number of sites in the Eastern Corn Belt Plains (51%), West Allegheny Plateau (61%) and the Interior Plateau (50%). The sites within the Erie-Ontario Lake Plains ecoregion are distributed across the 3 clusters. Ecoregions

Cluster 1

Cluster 2

Cluster 3

55:Eastern Corn Belt Plains

50.83%

27.01%

22.17%

57:Huron/Erie Lake Plains

14.78%

27.83%

57.39%

61:Erie-Ontario Lake Plain

43.55%

35.23%

21.23%

70:Western Allegheny Plateau

60.87%

25.76%

13.378%

71:Interior Plateau

49.61%

37.21%

13.18%

Table 4-2: Distribution of ecoregions in each cluster in Ohio. The percentages indicate the proportion of the sites in each ecoregion for each cluster.

52

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-5: Distribution of the 5 Level III Ecoregions in Ohio. The y axis represents the total sites in Ohio for each category. The barplots have been arranged from left to right, designated as Overall (all the 1848 sampling sites in Ohio), Cluster 1, Cluster 2 and Cluster 3. The numbers in front of each legend indicates the corresponding Ecoregion number.

4.4. Relationships between Biological and Environmental Variables 4.4.1. Visualization of Fish Metric Gradients on the SOM Ohio EPA proposed a modified set of 12 metrics, originally advocated in [Karr 1981]. Rating of 5, 3 or 1 are assigned to each metric according to whether its value approximates (5), deviates somewhat from (3) or strongly deviates (1) from the value expected at the reference site where human influence has been minimum.(Refer Appendix B: Table B2). Darter species are known to be insectivorous, habitat specialists and sensitive to physical and chemical environmental disturbances [OhioEPA 1987]. These factors make the darters reliable indicators of good water quality and habitat conditions. The metric related to sunfish (SUNSCORE) is primarily a measure of the degradation of pool habitats. Suckers represent a major component of the Ohio fish community and with their relatively long life spans, the metric (SUMINSCORE) assesses the past and present environmental conditions. DELSCORE deals with the incidence of DELT (Deformities, Eroded fins, Lesions and Tumors) anomalies in fish communities, an indication of stress and environment degradation. As stated above, the rank of the metrics indicates the degree of degradation in the biota in running streams. In Figure 53

Chapter 4: Analysis of the Ohio EPA dataset

4-6, we see that almost all the metric scores have a low to high gradient distribution as we move from the top to the bottom half of the SOM.

Figure 4-6: 12 Component planes for the fish metric scores visualized on the SOM (left) and the corresponding boxplots (right) for the individual cluster distributions. There is a clear gradient distribution in most of the metric components. The ranges for the metric scores are shown in the corresponding colorbar.

Figure 4-6 indicates a highly exacerbated condition for the fish biota in the upper rows of the SOM. This is confirmed by the distribution of the fish IBI in the next section.

4.4.2. Visualization of the Fish IBI The Ohio EPA has formulated the fish IBI as the sum of the individual fish Metric scores, indicating a range of 12 to 60. The fish IBI for a SOM neuron is calculated in a similar manner as given in the analysis of the MBSS dataset (Equation (3-5)). The distribution of the fish IBI and the corresponding boxplot are given in Figure 4-7. Each cluster is associated with IBI values within a particular range. This is made clearer by noting the means and the standard deviations within each cluster: Cluster 1 (44.11 ± 4.03), Cluster 2 (35.11 ± 3.27) and Cluster 3 (24.45 ± 4.19). This is advantageous, as analysis of the clusters will directly correspond to its effect on the fish IBI. 54

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-7: Distribution of the fish IBI on the SOM (left) and the corresponding boxplots (right) for the individual clusters. The different sizes of the neurons indicate the associated cluster for the neuron, whereby the largest (smallest) size indicates Cluster 1 (Cluster 3). Overall indicates that the data represents all the SOM neurons which cover data from the entire state. The low values of the Fish IBI are concentrated in the top right area of the SOM (Cluster 3).

The biological criteria adopted by the Ohio EPA are integrated with the system of use designations employed in the Ohio Water Quality Standards [Yoder and Rankin 1998]. The aquatic life use designations are assigned to individual waterbody segments based upon the potential to support that use according to narrative and numerical criteria. The narrative levels of the biological criteria correspond to the tiered layout (Figure 4-8) comprising of EWH (Exceptional Warmwater Habitat), WWH (Warmwater Habitat), MWH (Modified Warmwater Habitat) and LRW (Limited Resource Water).

55

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-8: Relationship of biological integrity to the quantitative biological criteria and the habitat uses in the Ohio Water Quality Standards. Source: [Yoder and Rankin 1998].

Ohio EPA has stipulated the ranges of the fish IBI for the habitat labels: >48 (EWH), 32-44 (WWH), 22-30 (Impounded MWH) and 20-24 (Channel modification MWH), although the actual values differ based on Ohio’s five ecoregions. Modified Warmwater Habitat represents extensively modified habitats that are capable of supporting the semblance of a warmwater biological community, but fall short of attaining WWH because of functional and structural deficiencies, primarily due to altered macrohabitat. From Figure 4-7, we can conclude that EWH is completely contained in Cluster 1, while MWH (Impounded and Channel modification) is limited to Cluster 3.

56

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-9: Distribution of the fish IBI in Ohio. The averaged fish IBI values on the SOM are duplicated for the sampling sites falling in the same SOM neuron and reproduced on the map.

Regions with poor fish IBI are concentrated in the western part of Ohio, particularly around Toledo and along the Wabash River at the western state border (Figure 4-9). Almost all of the small streams in the Toledo area have been channel modified to some degree [Yoder et al. 2000]. The Wabash river watershed was designated as Ohio’s worst watershed by the Ohio EPA in 1999. Lack of buffer zones, excessive nutrient and high bacteria levels were attributed as some of the reasons for the poor conditions. The basins around Lake Erie, especially around Cleveland in the Cuyahoga County also are also degraded. Figure 4-10 compares the SOM averaged fish IBI with the observed IBI at all the 1848 sampling sites. We see transition from high values to low as the SOM cell numbers increase (bottom to top SOM traversal).

57

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-10: Comparing the actual IBI values with the SOM averaged fish IBI in Ohio. The Mean Square Error (MSE) is calculated as 17.4142. Each frame represents a SOM cell neuron with the label indicated (in steps of 5) at the bottom right of the frame. While the scatter points in each frame represent the observed fish IBI of the sites falling in that SOM neuron, the horizontal line in the frame represents the averaged IBI value of those sites.

4.4.3. Visualization of Invertebrate Community Index (ICI) and Qualitative Habitat Evaluation Index (QHEI) The Ohio EPA measures the health of overall macroinvertebrate community through the Invertebrate Community index (ICI), modified from the Index of Biological Integrity (IBI). The ICI consists of 10 structural and functional metrics that are scored 0, 2, 4, or 6 depending on how closely the results approximate least disturbed reference conditions. A score of 6 approaches the highest quality community conditions. Summation of the individual metric scores yields an ICI value between 0 and 60. The ICI is patterned on the SOM and the distribution is noted (Figure 4-11). A significant positive correlation exists between the fish IBI and ICI (ρ = 0.8210).

58

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-11: Distribution of the ICI on the SOM (left) and the corresponding boxplots (right) for the individual clusters. The different sizes of the neurons indicate the associated cluster for the neuron, whereby the largest (smallest) size indicates Cluster 1 (Cluster 3). Overall indicates that the data represents all the SOM neurons which cover data from the entire state.

The Qualitative Habitat Evaluation Index (QHEI) is a benchmark employed by the Ohio EPA for relating the stream expectations to the habitat quality. The six metrics which sum up to form the QHEI are important attributes, which effectively quantify the physical habitat. A complete interpretation of the metrics and the index is given in [Rankin 1989]. Metrics for the QHEI include substrate, in-stream cover, channel morphology, riparian zone and bank erosion, pool/glide and riffle/run quality, and gradient. The highest possible score is 100. Streams that score above 60 are usually designated as warmwater habitat.

59

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-12: Distribution of the QHEI on the SOM (left) and the corresponding boxplots (right) for the individual clusters. The different sizes of the neurons indicate the associated cluster for the neuron, whereby the largest (smallest) size indicates Cluster 1 (Cluster 3). Overall indicates that the data represents all the SOM neurons which cover data from the entire state.

The results of the patterning of the benthic IBI on the SOM and the resultant boxplots are shown in Figure 4-12. The QHEI responds favorably to the fish IBI (ρ = 0.8796)

4.4.4. Visualization of Environmental Variable Gradients on the SOM As before, we have mapped the environment variables on the SOM, according to equation (3-6). Temperature is uniform across all the 3 clusters. pH also displays a consistent distribution with values between 7.6 and 8 (slightly alkaline), irrespective of the clusters. Hardness, Calcium and Magnesium have similar gradient across the SOM, which agrees with the definition of hardness as caused by various dissolved salts of calcium and magnesium. When the biochemical oxygen demand (BOD) levels are high, dissolved oxygen (DO) levels decrease because the oxygen that is available in the water is being consumed by the bacteria. This is indicated by the negative correlation between BOD and DO.

60

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-13: Distribution of the Water Chemistry variables (I) on the SOM (left) and the corresponding boxplots (right) for the individual clusters. Overall indicates that the data represents all the SOM neurons which cover data from the entire state. The labels are indicated in Appendix B: Table B 1

Except for isolated SOM neurons (Figure 4-13 - Figure 4-14), Ammonia, Nitrite,, Cadmium and Zinc do not have any impact across the SOM. Arsenic and Iron are elevated in poor IBI SOM cells.

61

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-14: Distribution of the Water Chemistry variables (II) on the SOM (left) and the corresponding boxplots (right) for the individual clusters. Overall indicates that the data represents all the SOM neurons which cover data from the entire state. The labels are indicated in Appendix B: Table B 1

For the habitat variables, we used the habitat metrics for the patterning. We included Embeddedness as a separate variable, even though it is included in the Substrate metric. All the habitat variables show an extreme gradient distribution across the SOM, signifying different habitat characteristics across the clusters. From Figure 4-15, we conclude that Embeddedness has an opposite gradient to all the Habitat Metrics. As in the MBSS dataset, the habitat is severely degraded in the upper rows of the SOM.

62

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-15: Distribution of Physical Habitat and Land Use (Landscape) variables on the SOM (left) and the corresponding boxplots (right) for the individual clusters. Overall indicates that the data represents all the SOM neurons which cover data from the entire state. The labels are indicated in Appendix B: Table B 1

4.4.5. Exploring the Relationships between the Environment Variables, Fish Metrics and the Fish IBI The correlation matrix concerning all the variables over 45 SOM neurons was calculated. (Figure 4-16). All the metric scores except one related to DELT anomalies (DELT) are strongly related to the habitat parameters. Forests cover (PER_FORWET) has a positive effect on the indices; agricultural (PER_AG) and urban lands (PER_URBDEV) have a negative correlation, although the correlation is not high.

63

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-16: Correlation matrix indicating the correlation between the environmental variables and the fish metrics and indices of integrity. The gray scale indicates the absolute value of the correlation, while the sign of the correlation is indicated in associated block.

The per-cluster median value for each of the three clusters was noted for all the fish metrics and the environment variables. The metric scores (Table 4-3) and the environment variables (Table 4-4) were assigned to the respective clusters in which the maximum per-cluster median value was achieved. (Refer Appendix B Table B 4, Table B 5 for actual calculations) Similar to the analysis in the MBSS dataset, the maximum/minimum per-cluster values are mapped on the SOM (Figure 4-17) to get a snapshot of the complex relationships between the metrics and the environment variables and to summarize the individual distribution on the SOM.

64

Chapter 4: Analysis of the Ohio EPA dataset

Cluster 1

Cluster 2

Cluster 3

SPSCORE DADSRNSCORE SUNSCORE SUMINSCORE INTSCORE TOLSCORE OMNISCORE INSSCORE TPIOSCORE NUMSCORE SPWNSCORE DELSCORE

Table 4-3: The fish metrics are listed in the cluster column in which their per cluster median value is maximized. Cluster 1

Cluster 2

Cluster 3

DO

PHOSPHORUS

TEMPERATURE

PH

CD

CONDUCTIVITY

CU

PB

BOD TSS AMMONIA NITRITE TKN NITRATE HARDNESS

Water Chemistry

CALCIUM MAGNESIUM CHLORIDE SULPHATE ARSENIC CD CU IRON ZN SUBSTRATE

EMBEDDED

COVER CHANNEL

Physical Habitat

RIPARIAN POOL RIFFLE GRADIENT_S

Land Use

PER_FORWET

PER_URBDEV

PER_AG

Table 4-4: The environmental variables listed under the cluster column in which their per cluster median value is maximized. The labels are indicated in Appendix B: Table B 1. The results have been summarized from Appendix B: Table B 5

65

Chapter 4: Analysis of the Ohio EPA dataset

Overlaying Figure 4-17 on Figure 4-7 helps us understand the finer details and the effect of each of the environment variable in maintaining the biological integrity thorough the fish IBI. In the case of metrics, it is a merely a manifestation of their basic definition whereby high metric scores results in high fish IBI and vice versa. The regions with poor IBI value, corresponding to the top rows in the SOM exhibit elevated concentrations of the dissolved metals (Copper, Iron, Arsenic etc.). These regions are characterized by high conductivity, embeddedness and agricultural land use. Regions with good habitat quality, expressed thorough the habitat metrics have good fish IBI. This indicates a positive correlation between physical and biological integrity.

Figure 4-17: Visualizing the relationship between the 12 Fish Metrics (top row) and the environmental variables (bottom row). Each variable and metric is plotted in the cluster where it exhibits maximal/minimal per cluster median value. Furthermore its name is listed within the neuron where it achieves the maximum/minimum in the designated cluster.

66

Chapter 4: Analysis of the Ohio EPA dataset

4.5.

Combining Analysis

SOM

Patterning

with

Correspondence

Before applying Correspondence Analysis, the fish species were patterned on the SOM through the labeled sampling sites. The original Ohio EPA dataset consisted of 151 fish species collected across the 1848 sampling sites in Ohio. After the patterning, 81 species were removed from the analysis due to zero counts in all the SOM neurons. Finally, CA was applied to the fish species data, with 70 fish species (Figure 4-18). The SOM clusters are superimposed on the CA ordination plot. Each fish species was associated with the closest SOM neuron (based on the Euclidean distance) and assigned the cluster label of that closest SOM cell. The results are summarized in Table 4-5.

Figure 4-18: Correspondence Analysis bi-plot showing association of fish species to clusters of SOM neurons obtained via k-means clustering. The first two axes account for 25% and 17% of the variation respectively. SOM cell numbers are shown with different colors according to the cluster they belong to, as shown on the colorbar, while the fish species are indicated by the cluster ticks, as shown in the legend The fish species names have not been included for clarity but are listed in Table 4-5.

We have tried to compare the results of the Ohio dataset (Table 4-5) with the MBSS dataset (Table 3-6). There were 20 species in common between the two tables, out of which 11 species were found to be in the IBI defined correct cluster in both tables. In particular, Blacknose dace, Creek Chub, Central Stoneroller, Common Shiner, Rock Bass, etc. were found in Cluster 1 (contributing to good fish IBI) of both tables. Golden Shiner was found in Cluster 3 (contributing to poor fish IBI).

67

Chapter 4: Analysis of the Ohio EPA dataset

Cluster 1 BLACKNOSE DACE (BKNODACE)

CREEK CHUB (CREKCHUB)

CENTRAL STONEROLLER (CENSTROL)

WHITE SUCKER (WHTSUCKR)

FANTAIL DARTER (FANTDART)

COMMON SHINER (COMSHINR)

BLUNTNOSE MINNOW (BLUNMINN)

REDSIDE DACE (REDDACE)

STRIPED SHINER (STPSHIN )

RAINBOW DARTER (RNBOWDRT)

LARGEMOUTH BASS (LGMTHBAS)

ROSEFIN SHINER (ROSYSHIN)

GREENSIDE DARTER (GRNDARTR)

BLACKSIDE DARTER (BKSDACE )

BLUEGILL SUNFISH (BLSUNF)

PUMPKINSEED SUNFISH (PMPSUNF )

SAND SHINER (SNDSH)

BANDED DARTER (BNDART)

ROCK BASS (ROCKBASS)

NORTHERN HOG SUCKER (NHOGSUKR)

SILVER SHINER (SLVSH)

LOGPERCH (LOGPERCH)

TROUT-PERCH (TRTPCH)

SPOTFIN SHINER (SPFNSHIN)

RIVER CHUB (RVRCHB)

ROSYFACE SHINER (RSYSH)

GOLDEN REDHORSE(GLDNREDH)

SMALLMOUTH BASS (SMMTHBAS)

SILVER REDHORSE (SLVREDH)

MIMIC SHINER (MMCSH)

BLACK REDHORSE (BLKRHOR)

BROOK SILVERSIDE (BRKSILV)

STEELCOLOR SHINER (SLCSH)

BIGEYE SHINER (BEYSH)

STONECAT MADTOM (STNMAD)

Cluster 2 CENTRAL MUDMINNOW (CNTMINN)

ORANGETHROAT DARTER (ORNGDART)

SILVERJAW MINNOW (SJAWMINW)

HORNYHEAD CHUB (HRNYCHB)

JOHNNY DARTER (JOHNDART)

MOTTLED SCULPIN (MTLSCULP)

GOLDFISH (GOLDFISH)

GREEN SUNFISH (GRSUNFSH)

YELLOW BULLHEAD (YLLWBULH)

SUCKERMOUTH MINNOW (SCKMNOW)

REDFIN SHINER (REDSH)

COMMON CARP (COMMCARP)

QUILLBACK CARPSUCKER (QUILCRPS)

ORANGESPOTTED SUNFISH(ORNGSUNF)

LONGEAR SUNFISH (LNGSUNF)

SPOTTED SUCKER (SPTSCK)

RIVER CARPSUCKER (RVRCRPSK)

WARMOUTH SUNFISH (WARSUNF)

GRASS PICKEREL (GRSPCK)

CHANNEL CATFISH (CHCATFIS)

GIZZARD SHAD (GIZZSHAD)

FRESHWATER DRUM (FRWTDM)

EMERALD SHINER (EMERDART)

SHORTHEAD REDHORSE (SHRTREDH)

SPOTTED BASS (SPTBASS )

WHITE CRAPPIE (WHITCRAP)

SAUGER X WALLEYE (SAUWALL )

DUSKY DARTER (DSKDART )

BLACK CRAPPIE (BLKCRAPI)

BULLHEAD MINNOW (BHMINN)

SMALLMOUTH BUFFALO (SMLBUFF)

Cluster 3 SOUTH. REDBELLY DACE (STREDDAC)

BLACKSTRIPE TOPMINNOW(BKSTRMNW)

FATHEAD MINNOW (FATHMINW)

GOLDEN SHINER (GLDNSHNR)

Table 4-5: Fish species associated with SOM neuron clusters. Row wise arrangement of species in each cluster section corresponds to traversing the corresponding ticks from right to left, top to bottom in Figure 4-18. Cluster 1

Cluster 2

Cluster 3

Darters

18

10

0

Round-bodied Suckers

12

7

0

Sunfish Species

9

20

0

Headwater Species

9

4

25

Sucker Species

15

17

0

Minnow Species

18

10

50

Intolerant Species

26

4

0

Tolerant Species

12

17

50

Omnivores

6

20

25

Insectivorous Species

83

62

50

Top Carnivores

9

7

0

Pioneering Species

3

13

25

Simple Lithophils

52

20

25

Table 4-6: Percentage of fish species having a certain ecological characteristic, (Appendix B: Table B 3) contributing to one of the 12 fish Metrics, computed after CA and clustering of the fish species based on Table 4-5.

68

Chapter 4: Analysis of the Ohio EPA dataset

Fathead Minnow and Golden Shiner in Cluster 3 are tolerant species which tend to dominate the community with decreasing water and habitat quality. A summary of the fish attributes as they contribute to the assessment of the fish metrics (Table 4-1) is shown in Table 4-6. High percentages of tolerant and omnivorous species in Cluster 3 compared to the other two clusters reiterate the labeling of cluster 3 as the one with severely impaired biological integrity.

4.6. Canonical Correspondence Analysis The associations between the fish species and the environment variables patterned on the SOM were used in the CCA. In the previous section, we indicated that 81 species had to be removed because of zero distribution. Also, CALCIUM and MAGNESIUM have high correlation (ρ > 0.95) with HARDNESS and were removed from the analysis. CCA was applied to 45 SOM neurons with 70 fish species and 31 environment variables (Figure 4-19). To explore the relationships between the environment variables and the SOM neuron sites, the SOM neuron scores were projected on the environment variable arrows. Based on the cluster label of each SOM neuron, the median projection for each cluster was noted. Table 4-7 segregates the environmental variables into each cluster based on the maximum median projection (Refer Appendix B: Table B 5). The bold text in Table 4-7 indicates the agreement of the SOM and the CCA in deciding the clusters for the environment variables. 25 out of 31 variables (81%) are noted in the same cluster by both techniques. The measure of conductivity (CONDUCTIVITY) indicates the amount of ions (electrically charged particles) present in the water. In this set, hardness (HARDNESS) is closely related to the conductivity. High sulfate values (SULPHATE) and extremely high conductivity indicate effects of mining disturbance. Nitrogen compounds like Nitrate, Nitrite, and Total Kjeldahl Nitrogen (TKN) may be from any organic source such as leaves and other plant materials, or from inorganic sources such as fertilizers. The major nitrogen sources include municipal and industrial wastewater, fertilizers, animal wastes, etc. Nitrogen compounds along with phosphorus are major nutrients for aquatic plants. Excessive amounts of these nutrients can cause algal bloom leading to eutrophication which produces more decaying matter for bacteria. Bacteria use up most of the dissolved oxygen (DO), which in turn causes fish kills. Good riparian zones (RIPARIAN) and presence of tree canopy (COVER) impede the flow of phosphorus (PHOSPHORUS) into the streams, with maximum phosphorus retention achieved by wooden riparian buffers [OhioEPA 1999]. In the absence of the riparian zones, phosphorus enters the stream attached to suspended sediments. [Novotny et al. 2003] have highlighted the problems of snowmelt runoff in urban areas, which could have a severe impact in the glaciated regions in northern Ohio. The salt laden snowmelt from urban areas (PER_URBDEV) causes transient high salinity and chloride (CHLORIDE) concentrations in urban streams, ultimately affecting the conductivity (CONDUCTIVITY) as well.

69

Chapter 4: Analysis of the Ohio EPA dataset

Cluster 1

Cluster 2

DO

TEMPERATURE(3,2)

Cluster 3 CONDUCTIVITY

PB(2,3)

TSS(3,2)

BOD PH(1) AMMONIA NITRITE TKN NITRATE PHOSPHORUS(2,1)

Water Chemistry

HARDNESS CHLORIDE SULPHATE ARSENIC CD CU(1) IRON ZN

Physical Habitat

SUBSTRATE

EMBEDDED

COVER CHANNEL RIPARIAN POOL RIFFLE GRADIENT_S

Land Use

PER_FORWET

PER_AG PER_URBDEV(2)

Table 4-7: Association of environmental variables with k-means based SOM neuron clusters. Each variable is listed in the column corresponding to the cluster with the largest median projection on that variable’s line. Bold text highlights variables for which both methods (SOM alone and the SOM combined with CCA) confirm association with a cluster. For the other variables, the numbers in parenthesis indicate the SOM assigned cluster association (Table 4-4). Furthermore, if a variable achieves maximum per cluster median value in a border neuron, the neighboring cluster number is also noted.

An effective segregation of the environmental gradients in the clusters 1 and 3 (occupying either ends of the fish IBI range in Figure 4-7) is obtained in Figure 4-19, which leads us to analyze the environmental variables as having either a positive or negative impact on the fish IBI. This is a stark difference from the corresponding plot in Maryland (Figure 3-20). The habitat gradients are diagonally across the nutrients and metals, which indicates a negative correlation between the two domains. All the habitat metrics, dissolved oxygen and forests land cover have a positive impact on the fish IBI, as noted from their position in Cluster 1 in Figure 4-19. Fish has a negative response to metal concentrations in water, which is evident from the presence of zinc, iron, copper, sulphate, etc. in Cluster 3. Based on the length of the arrow in Cluster 3, embeddedness (EMBEDDED) has a strong negative influence on the fish IBI. Agricultural land use (PER_AG) and urban development (PER_URBDEV) reduces the forest cover (PER_FORWET), which is a vital factor for the sustainability of the aquatic biota. 70

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-19: Canonical Correspondence Analysis plot showing the environment variables scaled by a factor of 4 and the scores of the k-means clustered SOM neurons in three different shades of gray. The first two axes account for 28% and 19% of the variation respectively. The variable-neuron cluster associations are indicated according to Table 4-7.

4.8. Discussion The results obtained in the analysis of the Ohio EPA dataset allow us to associate clusters to specific ranges of the fish IBI. The CCA results provide a quantitative analysis of the effects of the environmental variables on the fish biota. In the case of Ohio, we obtain an isolated set of environmental variables affecting the clusters, which simplify our analysis to a great extent. Embeddedness was found to be an important variable affecting the fish in Ohio; hence work on improving the watershed should revolve around analyzing the effects of embeddedness on the species composition. Streams with highly altered flow regimes often become wide, shallow and homogenous resulting in poor habitat for many fish species. Besides accounting for any low levels of dissolved oxygen, fishes need to constantly regulate and monitor the composition of its body fluids with respect to its immediate environment through osmosis. Hardness in the water is one of the important factors involved in osmoregulation. The body fluids of a freshwater fish contain more dissolved salts and ions than the surrounding water. As a result of this imbalance there is a constant influx of water into its body and a loss of salts and ions from the blood outwards. 71

Chapter 4: Analysis of the Ohio EPA dataset

Figure 4-20: Comparative ranking of the environment variables based on the length of the arrow of the environment variables in Figure 4-19. The length of the arrow represents the importance of each variable in explaining the variation in fish species distribution. Only the top 20 variables have been shown. The labels are indicated in Appendix B: Table B 1.

In its list of the leading causes of aquatic life impairment, [OhioEPA 1998] noted that ammonia had dropped from the second leading cause in 1988 to ninth, resulting from the construction of new sewage treatment plants in the 1980s throughout Ohio. We note the position of ammonia in Figure 4-20 as a reaffirmation of this fact. Landscape alteration, fueled by continuous population growth is a critical stressor constantly impacting the watersheds and needs to be constantly monitored to support the objectives of the Clean Water Act. By integrating evaluations of QHEI and IBI, managers can gain a combined perspective of both the physical and biological settings in the state. This comprehensive assessment is significant for evaluating disturbance and land use practices. Habitat and riparian restoration in the headwater streams would ameliorate the poor condition of most Lake Erie river mouths, harbors and nearshore areas [OhioEPA 1999]. Apart from providing a substantial knowledge base in terms of the interrelationships between the various variables, the findings are closely related with those from the Ohio EPA, which provides a validation for this methodology.

72

Chapter 5: Conclusions and Future Research The research done as part of this report offers a concise view of the prevailing conditions across the entire state, with SOM based visualization techniques highlighting all the parallel relationships in a novel manner, rather than the traditional approach of box and whisker plots and regression analysis. The results indicate the efficiency of the SOM in visualizing the state of streams in Maryland and Ohio and aid the watershed manager in making and implementing decisions which will ultimately lead to restoration of the degraded watersheds. In the case of Maryland, we found that the Coastal areas around the Chesapeake Bay are the most degraded regions. Habitat parameters surpassed chemical parameters as important variables related to species compositions, which in turn decides the fish IBI. To validate the approach with different datasets, the same methodology was applied to a subset of the Ohio EPA dataset. The Lake Erie region and the northwestern region in Ohio were found to have poor fish IBI compared to the rest of the state. Compared to Maryland, we found that the clusters for Ohio were clearly defined and isolated in terms of the IBI ranges, which made the interpretation much easier. Embeddedness was found to be an important variable in both the datasets; hence work on improving the watershed should revolve around analyzing the effects of embeddedness on the species composition. The research confirms the growing realization in management policies that habitat should be a major consideration in developing non point source strategies where the objective is to restore and protect beneficial aquatic life uses. Analyzing the chemical point sources in isolation has lead to increased habitat degradation in some cases even though the original aim was to protect the environment. [Yoder and Rankin 1998] cite an example in southwestern Ohio in this context where sanitary sewers were installed in stream beds to reduce the water pollution. However, the sewer construction and maintenance that followed only resulted in massive habitat degradation with some permanently damaged streams. This report succeeds in identification of important variables associated with land use, riparian zones, and other important covariates, as they affect the fish which would guarantee the maintenance of aquatic life uses in streams and rivers over fairly broad areas. The use of unsupervised neural networks significantly broadens the scope of watershed management. Since the results hinge on prevailing environmental conditions, the resulting knowledge obtained can abet the management and restoration activities more efficiently. The research also provides a means to validate the objectives of the current policies and conceive new goals and trends which can ultimately reflect in a more integrated and holistic view of the streams and their watersheds.

5.1. Graphical User Interface We have designed a Graphical User Interface (GUI) in Matlab, encompassing all the steps in the methodology, indicated in Chapter 3. The layout of the GUI is simple yet comprehensive, keeping in mind the interdisciplinary nature of the audience. Figure 73

Chapter 5: Conclusions and Future Research

5-1 shows the introductory window of the GUI, giving the user an option to select either of the two states, Maryland and Ohio on which the report is based. Once the state has been decided, the user is presented with Figure 5-2 which displays all the modules to be used in the methodology. We have used basic options like the push button, static text, and editable text to guide the user through the modeling. All these elements are given by Matlab GUI and could be programmed by the user to modify them and to make the appropriate interface.

Figure 5-1: Introductory screen prompting the user to select either of the two states used in this research.

Once the state database is loaded and the appropriate number of map units is decided, the SOM is trained by presenting the metric data as the input. The clustering of the SOM happens in the next module, with the Davies-Bouldin index used as criteria for selecting the number of clusters. This module also performs the necessary data calculations and summarizations to provide the input for the next two modules. After the SOM has been trained and clustered, the user has the option to visualize the spatial distribution of the sampling sites on the state map either on the cluster level or the neuron level. The user can either select specific components in either the clusters or neurons, using the CTRL key, or a range of components using the SHIFT key. The default setting is the distribution of the SOM based on all the clusters. The next module deals with the distribution of the environmental variables, fish metrics, fish IBI and other indices of integrity (related to benthic macroinvertebrates and physical habitat) on the SOM and/or the associated boxplots. We plan to add the spatial distribution of the variables as another option in the near future.

74

Chapter 5: Conclusions and Future Research

Figure 5-2: Snapshot of the GUI for the Ohio EPA dataset, indicating the methodology.

Finally, Correspondence Analysis could be performed with the fish data, reduced through the SOM. Canonical Correspondence Analysis provides two sets of information in addition to the biplot of the environment variables and the SOM neurons. One of the display box shows the correlated variables (ρ > 0.95), indicating that variables in the second column have been removed. The other box displays the environmental variables sorted in decreasing length of the arrows in the CCA plot. The same information is displayed with respect to the clusters, whereby each cluster is associated with a set of environmental variables, using the maximum per-cluster median projection criteria. Finally, the user has the option of saving the figures in a file, reset the session or exiting the model. The figures are saved in the Portable Network Graphics (PNG) format. We have used a unique combination of the state and the current time to provide a separate path in each session for storing the figures. The reset option clears the database and the associated listboxes highlighting the variables in various parts of the GUI. The user needs to reload the database again to move ahead in the session. The exit option closes all the figures and ends the session for the user.

5.2. Future Research There is still work to be done to realize the final hierarchical model, as proposed in the Star Watershed Program. As more and more data becomes available, the existing methodology will be tested and tuned accordingly. Since the final model calls for prediction of the fish IBI from the root stressors, the SOM needs to be combined with supervised neural algorithms to account for the prediction. Cluster level prediction could be employed to realize specialized networks dealing with specific metrics traits. 75

Chapter 5: Conclusions and Future Research

Urban watershed management, protection and restoration strategies continue to develop as new information is revealed and relationships between instream biological community performance and stressors are better understood and tuned to account for regional imbalances. The Ohio EPA has been at the forefront of the environmental success stories witnessed in the state. A study by [Yoder et al. 2000] focusing on six urban areas in Ohio (Cincinnati, Cleveland/Akron, Columbus, Dayton, Toledo, and Youngstown), while emphatically dismissing the notion of simple stressors like imperviousness for urban landscapes advocated multiple stressors comprising variables dealing with habitat degradation, chemical loadings to effectively represent urban landscapes. Detection of ecological trends for assessment at geographic scales finer than statewide is one of the many questions that the policy makers face which addressing the issue of biological integrity. However, this poses a difficult challenge as sampling at finer scales requires substantial sampling effort which is difficult to achieve. Results from this research help us to identify locations with similar biological traits with an associated overview of the environment. This is beneficial as it allows the manager to pick up specific sites within specific basins for monitoring purposes. The modeling framework presented in this technical report provides a tool to guide researchers and managers in demonstrating the degree of local degradation with respect to the surroundings. Well established biological surveys will go a long way in providing the required thrust to analyze nonpoint source management. Identification and inclusion of specific stressors related to landuse and riparian conditions along with biological communities in the database should be one of the high priority tasks for the state agencies to ensure that the ensuing research spans well developed bioassesment programs, as envisioned in the Clean Water Act. This research lays the appropriate groundwork to build a layered hierarchical model, as proposed in the STAR Watershed Project. We hope that this research will contribute in establishing ANN technology as an intrinsic tool in problem solving for the environmental engineer.

76

Bibliography Anctil, F., and Tape, D.G. "An exploration of artificial neural network rainfall-runoff forecasting combined with wavelet decomposition", Journal of Environmental Engineering and Science, 3(Supplement S1): S121-S128, 2004. Baxter, C.W., Smith, D.W., and Stanley, S.J. "A Comparison of Artificial Neural Networks and Multiple Regression Methods for the Analysis of Pilot-Scale Data", Journal of Environmental Engineering and Science, 3(Supplement S1): S45-S58, 2004. Bishop, C.M. Neural Networks for Pattern Recognition. Oxford University Press, New York. 1995. Blair, R.B. "Land Use and Avian Species Diversity Along an Urban Gradient", Ecological Applications, 6(2): 506-519, 1996. Blankenship, K. "Findings of the Maryland Biological Stream Survey", Bay Journal, 10(3): 11, 2000. Bray, J.R., and Curtis, J.T. "An Ordination of the Upland Forest Communities of Southern Wisconsin", Ecological Monographs, 27(4): 325–349, 1957. Brosse, S., Giraudel, J.L., and Lek, S. "Utilisation of non-supervised neural networks and principal component analysis to study fish assemblages", Ecological Modelling, 146: 159-166, 2001. Çamdevýren, H., Demýr, N., Kanik, A., and Keskýn, S. "Use of principal component scores in multiple linear regression models for prediction of Chlorophyll-a in reservoirs", Ecological Modelling, 181(4): 581-589, 2005. Canoco. Web page. Microcomputer Power. http://www.microcomputerpower.com/catalog/canoco.html. 10/2004. Carpenter, G. A., and Grossberg, S. "The ART of Adaptive Pattern Recognition by a Self-Organising Neural Network", IEEE Computer, 21: 77-88, 1988. Comeleo, R. L., Paul, J.F., August, P.V., Copeland, J., Baker, C. , Hale, S. S. , and Latimer, R. L. "Relationships between watershed stressors and sediment contamination in Chesapeake Bay estuaries", Landscape Ecology, 11: 307– 319., 1996. Danh, N.V., Phien, H.N., and Das Gupta, A. "Neural Network Models for River Flow Forecasting", Water SA, 25(1): 33-39, 1999. Davies, J.L., and Bouldin, D.W. "A cluster separation measure." IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(4): 224-227, 1979. Dow, C. L., and Zampella, R.A. "Specific conductance and pH as indicators of watershed disturbance in streams of the New Jersey Pinelands, USA." Environmental Management, 26(4): 437-445, 2000. ESCAP. Guidelines and Manual on Land-use Planning and Practices in Watershed Management and Disaster Reduction, Economic and Social Commission for Asia and the Pacific (ESCAP), United Nations. 1997. Fausch, K.D., Hawkes, C.L., and Parsons, M.G. Models that predict standing crop of stream fish from habitat variables: 1950-85. General Technical Report PNWGTR 213., United States Forest Service, Pacific Northwest Field Station, Portland, Oregon, pp. 52. 1988. Feck, J., and Hall, R.O. "Response of American Dippers (Cinclus mexicanus) to variation in stream water quality", Freshwater Biology, 49(9): 1123-1137, 2004. 77

Gevrey, M, Dimopoulos, L., and Lek, S. "Review and comparison of methods to study the contribution of variables in artificial neural network models", Ecological Modelling, 160(3): 249-264, 2003. Gevrey, M., Rimet, F., Park, Y.S., Giraudel, J.L., Ector, L., and Lek, S. "Water quality assessment using diatom assemblages and advanced modelling techniques", Freshwater Biology, 49(2): 208-220, 2004. Giraudel, J.L., and Lek, S. "A comparison of self-organizing map algorithm and some conventional statistical methods for ecological community ordination." Ecological Modelling., 146: 329-339, 2001. Greenacre, M.J., and Vrba, E.S. "A correspondence analysis of biological census data", Ecology, 65: 984–997, 1984. Guertin, D.P., Miller, S.N., and Goodrich, D.C., 2000. Emerging tools and technologies in watershed management, USDA Forest Service Proceedings RMRS-P-13: Proceedings of the Conference on Land Stewardship in the 21st Century: The Contributions of Watershed Management, Tucson, AZ, pp. 194204. Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle River, NJ. 1999. HEC. Web page. Simulation of Flood Control and Conservation Systems. http://www.hec.usace.army.mil/software/legacysoftware/hec5/hec5.htm. 03/2005. Hijikata, Y., Takeuchi, H., Yoshida, T., and Nishida, S., 1997. A Dynamic Linkage Method for Text Data Based on Self-Organizing Map, Proceedings of 6th IEEE International Workshop on Robot and Human Communication, Sendai, Japan, pp. 420-425. Hill, M.O. "Correspondence analysis: a neglected multivariate method." Applied Statistics., 23(3): 340-354, 1974. Jackson, L.J., Trebitz, A.S., and Cottingham, K.L. "An introduction to the practice of ecological modeling." BioScience., 50(8): 694:706, 2000. Joutsiniemi, S.L., Kaski, S., and Larsen, T. A. "Self-organizing map in recognition of topographic patterns of EEG spectra", IEEE Transactions on Biomedical Engineering, 42(11): 1062-1068, 1995. Karr, J.R. "Assessment of biotic integrity using fish communities." Fisheries., 6(6): 21-27, 1981. Karr, J.R. "Biological integrity: a long-neglected aspect of water resource management." Ecological Applications, 1: 66:84, 1991. Kendall, B.E., Briggs, C.J., Murdoch, W.W., Turchin, P., Ellner, S.P., McCauley, E., Nisbet, R.M., and Wood, S.N. "Why do populations cycle? A synthesis of statistical and mechanistic modeling approaches." Ecology, 80: 1789-1805, 1999. Kingston, G.B., Maier, H.R., and Lambert, M.F., 2004. A Statistical Input Pruning Method for Artificial Neural Networks Used in Environmental Modelling, Transactions of the 2nd Biennial Meeting of the International Environmental Modelling and Software Society, Osnabrueck, Germany, pp. 87-92. Kiviluoto, K. "Topology preservation in self-organizing maps", Proceedings of IEEE International Conference on Neural Networks, 1: 294-299, 1996. Kohonen, T. "The self-organizing map", Proceedings of the IEEE, 78(9): 1464-1480., 1990. Kohonen, T. Self-Organizing Maps. Springer, Berlin. 2001.

78

Kohonen, T., Oja, E., Simula, O., Visa, A., and Kangas, J. "Engineering application of the self-organizing map", Proceedings of the IEEE, 84(10): 1358-1384, 1996. KYPIPE. Web page. KY Pipe. http://www.kypipe.com/. 03/2005. Legendre, P., and Gallagher, E. "Ecologically meaningful transformations for ordination of species data", Oecologia, 129: 271-280, 2001. Legendre, P., and Legendre, L. Numerical ecology. Elsevier Science BV, Amsterdam. 1998. Lek, S., Delacoste, M., Baran, P., Dimopoulos, I., Lauga, J., and Aulagnier, S. "Application of neural networks to modelling non linear relationships in ecology." Ecological Modelling, 90(2-3): 39:52, 1996. Lek, S., and Guégan, J.F. Artificial neuronal networks: application to ecology and evolution. Springer-Verlag, Berlin. 2000. Lewin, N., Zhang, Q., Chu, L., and Shariff, R. "Predicting total trihalomethane formation in finished water using artificial neural networks", Journal of Environmental Engineering and Science, 3(Supplement S1): S35-S43, 2004. Luo, X., Singh, C., and Patton, A.D., 2000. Power system reliability evaluation using self organizing map, IEEE Power Engineering Society Winter Meeting. Conference Proceedings, Piscataway, NJ, USA, pp. 1103-1108. Luttrell, S.P. "A Bayesian Analysis of Self-Organizing Maps", Neural Computation, 6: 767-794, 1994. Lye, W. L., Chekima, A., F., Liau C., and Dargham, J.A., 2002. Iris recognition using self-organizing neural network, Student Conference on Research and Development, Shah Alam, Malaysia, pp. 169- 172. Manel, S., Dias, J.M., and Ormerod, S.J. "Comparing discriminant analysis, neural networks and logistic regression for predicting species distributions: a case study with a Himalayan river bird", Ecological Modelling, 120(2-3): 337–347, 1999. Matejicek, L., 2003. Development of software tools for ecological field studies using ArcPad, ESRI 24th Annual International User Conference, San Diego, California, pp. 12. Matlab. Web page. The Mathworks, Inc. http://www.mathworks.com. 12/2004. Mercurio, G., Chaillou, J.C., and Roth, N.E. Guide to Using 1995-1997 Maryland Biological Stream Survey Data. CWBP-MANTA-EA-99-5, Versar, Inc. for Maryland Department of Natural Resources, Monitoring and Non-Tidal Assessment Division, Columbia, MD. 1999. Momcilo, M., Tsai, C.W.S., and Demissie, M. "Uncertainty of weekly nitrate-nitrogen forecasts using artificial neural networks", Journal of Environmental Engineering, 129(3): 267-274, 2003. Novotny, V. Water Quality: Diffuse Pollution and Watershed Management. John Wiley and Sons, New York, NY. 2003. Novotny, V. Linking Diffuse Pollution to Water Body Integrity. Technical Report #1, Center for Urban Environmental Studies, Northeastern University, Boston, MA. 2004. Novotny, V., Bartosova, A., O'Reilly, N., and Ehlinger, T. "Unlocking the relationship of biotic integrity of impaired waters to anthropogenic stresses", Water Research, 39: 184-198, 2005. Novotny, V., Smith, D.W., and Kuemmel, D.A. "Management and control of diffuse urban snowmelt pollution", Civil Engineering Practice, 18(2): 17-32, 2003.

79

Oberdorff, T., Pont, D., Hugueny, B., and Chessel, D. "A probabilistic model characterizing fish assemblages of French rivers: a framework for environmental assessment", Freshwater Biology, 46: 399-415, 2001. OhioEPA. Biological Criteria for the Protection of Aquatic Life: Volumes I – III, Ohio Environmental Protection Agency, Columbus, Ohio. 1987. OhioEPA. The State of the Aquatic Ecosystem: Ohio Rivers & Streams: Causes and Sources of Impairment, Factsheet. FS-10-MAS-98. 1998. OhioEPA. Association between Nutrients, Habitat, and the Aquatic Biota in Ohio Rivers and Streams. Technical Bulletin MAS/1999-1-1., Ohio EPA, pp. 70. 1999. Olden, J.D., and Jackson, D.A. "Fish-habitat relationships in lakes: Gaining predictive and explanatory insight using artificial neural networks", Transactions of the American Fisheries Society, 130: 878-897, 2001. Olden, J.D., and Jackson, D.A. "A comparison of statistical approaches for modeling fish species distributions", Freshwater Biology, 47: 1976-1995, 2002. Olden, J.D., Joy, M.K., and Death, R.G. "An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data", Ecological Modelling, 178(3-4): 389-397, 2004. Omernik, J.M. "Ecoregions of the conterminous United States (map supplement)", Annals of the Association of American Geographers, 77: 118-125, 1987. Ozesmi, S. L., and Ozesmi, U. "An artificial neural network approach to spatial habitat modelling with interspecific inter-action." Ecological Modelling, 116: 15-31, 1999. Paeqann. Web page. Predicting Aquatic Ecosystem Quality using Artificial Neural Networks. http://aquaeco.ups-tlse.fr/. 10/2004. Palmer, M.W. "Putting things in even better order: the advantages of canonical correspondence analysis", Ecology, 74: 2215-2230, 1993. Park, Y.S., Céréghino, R., Compin, A., and Lek, S. "Applications of artificial neural networks for patterning and predicting aquatic insect species richness in running waters." Ecological Modelling, 160(3): 265-280, 2003. Park, Y.S., Chon , T.S., Kwak, I.S., and Lek, S. "Hierarchical community classification and assessment of aquatic ecosystems using artificial neural networks." Science of the Total Environment, 327: 105-122, 2004. Park, Y.S., Verdonschot, P.F.M., Chon, T.S. , and Lek, S. "Patterning and predicting aquatic macroinvertebrate diversities using artificial neural network", Water Research, 37: 1749-1758, 2003. Paruelo, J.M., and Tomasel, F. "Prediction of functional characteristics of ecosystems: a comparison of artificial neural networks and regression models", Ecological Modelling, 98: 173–186, 1997. Pijanowski, B., Brown, D., Shellito, B., and Manik, G. "Using neural networks and GIS to forecast land use changes: A land transformation model", Computers, Environment and Urban Systems, 26(6): 553-575, 2002. Pijanowski, B.C., Gage, S.H., Long, D.E., and Cooper, W.E. 2000. A land transformation model for the Saginaw Bay Watershed. In: J. Sanderson and L. Harris (Editors), Landscape Ecology: A Top Down Approach. CRC Press., Boca Raton, FL, pp. 183-198. Rankin, E.T. The qualitative habitat evaluation index (QHEI): rationale, methods, and application, Ohio Environmental Protection Agency, Columbus, Ohio. 1989.

80

RAS. Web page. Hydrologic Engineering Centers River Analysis System. http://www.hec.usace.army.mil/software/hec-ras/hecras-hecras.html. 03/2005. Reash, R.J., and Pigg, J. "Physicochemical Factors Affecting the Abundance and Species Richness of Fishes in the Cimarron River", Proceedings of the Oklahoma Academy of Science, 70: 23-28, 1990. Recknagel, F. "Applications of machine learning to ecological modelling", Ecological Modelling, 146: 303-310, 2001. Rogers, C.E., Brabander, D.J., Barbour, M.T., and Hemond, H.F. "Use of physical, chemical, and biological indices to assess impacts of contaminants and physical habitat alteration in urban streams", Environmental Toxicology and Chemistry, 21(6): 1156–1167, 2002. Roth, N.E., Southerland, M.T., Chaillou, J.C., Kazyak, P.F., and Stranko, S.A. Refinement and Validation of a Fish Index of Biotic Integrity for Maryland Streams. CBWP-MANTA-EA-00-2, Versar, Inc., Columbia, MD, with Maryland Department of Natural Resources, Monitoring and Non-Tidal Assessment Division. 2000. Roth, N.E., Southerland, M.T., Mercurio, G., Chaillou, J.C., Heimbuch, D.G., and Seibel, J.C. State of the Streams: 1995-1997 Maryland Biological Stream Survey results. CBWP-MANTA- EA-99-6., Versar, Inc., Columbia, MD, and Post, Buckley, Schuh, and Jernigan, Inc., Bowie, MD, for Maryland Department of Natural Resources, Monitoring and Non-Tidal Assessment Division. 1999. Rousseau, A.N., Mailhot, A., Turcotte, R., Duchemin, M., Blanchette, C., M., Roux, Etong, N., Dupont, J., and Villeneuve, J.P. "GIBSI: An integrated modelling system prototype for river basin management", Hydrobiologia, 4223: 465-475, 2000. Roussinov, D., and Chen, H. "A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation", Communication Cognition and Artificial Intelligence, 15(1-2): 81-112, 1998. Scardi, M., Lek, S., Lim, P., Di Dato, P., and Oberdorff, T., 2000. Artificial neural networks as a tool for predicting fish community composition in rivers, The Third World Congress of Nonlinear Analysts (WCNA-2000), Catania, Italy. Simonovic, S.P. "Tools for Water Management One View of the Future", IWRA Water International, 25: 76-88., 2000. SPSS. Web page. SPSS, Inc. http://www.spss.com/. 12/2004. SSARR. Web page. Streamflow Synthesis and Reservoir Regulation Model. http://www.nwd-wc.usace.army.mil/report/ssarr.htm. 03/2005. STAR. Web page. Science To Achieve Results. http://www.coe.neu.edu/environment/star.htm. 02/2005. SUTRA. Web page. The Scientific Software Group. http://www.ssgint.com/sutra_overview/sutra_overview.html. 03/2005. Takamura, N., Kadono, Y., Fukushima, M., Nakagawa, M., and Kim, B.H. "Effects of aquatic macrophytes on water quality and phytoplankton communities in shallow lakes", Ecological Research, 18(4): 381-395, 2003. Ter Braak, C. J. F. "Canonical correspondence analysis: a new eigenvector method for multivariate direct gradient analysis", Ecology, 67: 1167-1179., 1986. Ter Braak, C. J. F. "The analysis of vegetation-environment relationships by canonical correspondence analysis", Vegetatio, 69: 69-77, 1987. Ter Braak, C. J. F. "Canonical community ordination Part I: Basic theory and linear methods", Écoscience, 1: 127-140., 1994. 81

Ter Braak, C. J. F., and Verdonschot, P. F. M. "Canonical correspondence analysis and related multivariate methods in aquatic ecology", Aquatic Sciences, 57: 255-289, 1995. Tewari, A., Porter, C., Peabody, J., Crawford, Ed., Demers, R., Johnson, C.C., Wei, J.T., Divine, G.W., O'Donnell, C., Gamito, E., and Menon, M. "Predictive modeling techniques in prostate cancer", Molecular Urology, 5: 147-152, 2001. Thompson, P.A. "Correspondence Analysis in Statistical Package Programs", The American Statistician, 49(3): 310-316, 1995. Tian, D.Q., Sorooshian, S., and Myers, D.E. "Correspondence analysis with Matlab", Computers and Geosciences, 19: 1007-1022, 1993. Ultsch, A., and Siemon, H. P., 1990. Kohonen's self organizing feature maps for exploratory data analysis, Proceedings of ICNN'90, International Neural Network Conference, Dordrecht, Netherlands, pp. 305-308. USEPA. The Quality of our nation's waters: a summary of the national water quality inventory: 1998 report to Congress. Pub. No. 841-S-00-001, Office of Water., Washington, DC, USA. 2000. Vesanto, J. "SOM-based data visualization methods", Intelligent Data Analysis, 3: 111-126, 1999. Vesanto, J., and Alhoniemi, E. "Clustering of the self-organizing map", IEEE Transactions on Neural Networks, 11: 586-600, 2000. Wiley, D.J., Morgan, R.P., Hilderbrand, R.H., Raesly, R.L., and Shumway, D.L. "Relations between Physical Habitat and American Eel Abundance in Five River Basins in Maryland", Transactions of the American Fisheries Society, 133(3): 515-526, 2004. Yabe, K., and Nakamura, T. "Base mineral inflow in a remnant cool-temperate mire ecosystem", Ecological Research, 17: 601-613, 2002. Yoder, C.O., Miltner, R.J., and White, D., 2000. Using biological criteria to assess and classify urban streams and develop improved landscape indicators. In: S. Minamyer, J. Dye and S. Wilson (Editors), National Conference on Tools for Urban Water Resources Management and Protection., U. S. Environmental Protection Agency, Cincinnati, Ohio. EPA/625/R-00/001., pp. 32-44. Yoder, C.O., and Rankin, E.T. "The role of biological indicators in a state water quality management process", Environmental Monitoring and Assessment, 51(1-2): 61-88, 1998. Zhang, X., Ball, G., and Halper, E., 2000. Application of Remote Sensing and Geographic Information Systems to Ecosystem-Based Urban Natural Resource Management, USDA Forest Service Proceedings RMRS-P-13: Proceedings of the Conference on Land Stewardship in the 21st Century: The Contributions of Watershed Management, Tucson, AZ, pp. 409-413.

82

Appendices

Appendix A: Maryland Biological Stream Survey Table A1: Database information

Variable

Site Information Type Label

IDX SITE ST_NAME YEAR ECOREGION REGION PHYSIO COUNTY BASIN SEGMENT ORDER SAMP_SPR DATE_SPR SAMP_SUM DATE_SUM LAT LONG NORTHING EASTING SHEDCODE SHEDNAME

Num Char Char Num Char Char Char Char Char Num Num Char Num Char Num Num Num Num Num Num Char

TEMP_FLD DO_FLD PH_LAB PH_FLD COND_LAB COND_FLD ANC_LAB DOC_LAB NO3_LAB SO4_LAB ACIDSRC

Num Num Num Num Num Num Num Num Num Num Char

PASTURE CHANNEL CONCRETE STORMDRN EFF_DIS BEAVPOND INSTRHAB EPI_SUB VEL_DPTH POOLQUAL RIFFQUAL CHAN_ALT BANKSTAB EMBEDDED CH_FLOW SHADING REMOTE AESTHET WOOD_DEB NUMROOT RIP_WID BUFF_TYP ADJ_COVR MAXDEPTH ST_GRAD AVGWID AVGTHAL AVG_VEL FLOW

Char Char Char Char Char Char Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Char Char Num Num Num Num Num Num

Index Site Identification Stream Name Year Sampled Ecoregions Geographic Region Physiographic Province County Basin Sample Segment Strahler Order Spring Sampleability Actual Date Sampled - Spring Summer Sampleability Actual Date Sampled - Summer Latitude Longitude MD Plane Coordinate MD Plane Coordinate Maryland 8-digit Watershed Code Maryland Watershed Name

Water Chemistry Water Temperature ((C) Dissolved Oxygen (mg/l) Lab pH In-situ pH Lab Conductance (µmho/cm) In-situ Conductance (µmho/cm) Acid Neutralizing Capacity (µeq/l) Dissolved Organic Carbon (mg/l) Nitrate Nitrogen (mg/l) Sulfate (mg/l) Source of Acidity

Physical Habitat Pasture Channelized Concrete/Gabion Storm Drain Effluent Discharge Beaver Pond Instream Habitat Structure Epifaunal Substrate Velocity/Depth Diversity Pool/Glide/Eddy Quality Riffle/Run Quality Channel Alteration Bank Stability Embeddedness Channel Flow Status Shading Remoteness Aesthetic Rating Number of Woody Debris Number of Rootwads Riparian Buffer Width (m) Riparian Buffer Type Adjacent Land Cover Type Maximum Depth (cm) Stream Gradient (%) Average Wetted Width (m) Average Thalweg Depth (cm) Average Velocity (m/s) Streamflow (cfs)

Land Use ACREAGE URBAN AGRI FOREST WETLANDS BARREN WATER HIGHURB LOWURB PASTUR PROBCROP ROWCROP CONIFER DECIDFOR MIXEDFOR EMERGWET WOODYWET COALMINE TRANS

Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num

PHI BKTRFLAG BLACKWAT STRATA_R FIBI_98 BIBI_98 HILSNHOF EPT_TAXA

Num Num Num Char Num Num Num Num

NUMNATIVE NUMBENTHIC NUMINTOL PCTOL PCDOM PCGOI PCINSECT NUMINDVSQM BIOPSQM PCSPAWN

Num Num Num Num Num Num Num Num Num Num

Catchment Area (acres) Urban Land Use (%) Agricultural Land Use (%) Forest Land Use (%) Wetland Land Use (%) Barren Land Use (%) Water Land Use (%) High Intensity Urban Land Use (%) Low Intensity Urban Land Use (%) Hay/pasture/grass Land Use (%) Probable Row Crop Land Use (%) Row Crop Land Use (%) Conifer (Evergreen) Forest Land Use (%) Deciduous Forest Land Use (%) Mixed Forest Land Use (%) Emergent Wetlands Land Use (%) Woody Wetland Land Use (%) Coal Mine (%) Transitional Land Use (%)

Indicators Physical Habitat Index Brook Trout Abundance Blackwater Stream Fish IBI Stratum Fish Index of Biotic Integrity Benthic Index of Biotic Integrity Hilsenhoff Index of Biotic Integrity Number of EPT Taxa

Fish Metrics Number of Native species Number of Benthic fish Species Number of Intolerant Species Percentage Tolerant fish Percentage of dominant species Percentage of generalists, omnivores, and invertivores Percent insectivores Number of individuals per square meter Biomass (g) per square meter Percent lithophilic spawners

Table A2: Fish Metrics and IBI formulation

Metrics and scoring criteria for the recommended final fish IBI. Scoring Criteria 5 3 Coastal Plain Number of Native Species Number of Benthic Species Number of Intolerant Species Percent tolerant fish Percent abundance of dominant species Percent generalists, omnivores, and invertivores Number of individuals per square meter Biomass (g) per square meter

Criteria vary with stream size(see below) Criteria vary with stream size(see below) Criteria vary with stream size(see below) ≤ 50 50 < x ≤ 93 ≤ 33 33 < x ≤ 78 ≤ 92 92 < x 93 > 78 100 < 0.42 < 3.6

Eastern Piedmont Number of Native Species Number of Benthic Species Number of Intolerant Species Percent tolerant fish Percent abundance of dominant species Percent generalists, omnivores, and invertivores Number of individuals per square meter Biomass (g) per square meter Percent lithophilic spawners

Criteria vary with stream size(see below) Criteria vary with stream size(see below) Criteria vary with stream size(see below) ≤ 41 41 < x ≤ 65 ≤ 30 30 < x ≤ 52 ≤ 86 86 < x ≤ 99.7 ≥ 0.81 0.35≤ x < 0.81 ≥ 8.0 3.7 ≤ x < 8.0 ≥ 62 22 ≤ x < 62

> 65 > 52 > 99.7 < 0.35 < 3.7 < 22

Highland Number of Benthic Species Number of Intolerant Species Percent tolerant fish Percent abundance of dominant species Percent generalists, omnivores, and invertivores Percent insectivores Percent lithophilic spawners

Criteria vary with stream size(see below) Criteria vary with stream size(see below) ≤ 28 28 < x ≤ 71 ≤ 49 49 < x ≤ 91 ≤ 49 49 < x ≤ 92 ≥ 48 8 ≤ x < 48 ≥ 70 42 ≤ x < 70

> 71 > 91 > 92 38 >3 >5 >3