Journal of Applied Statistics, Vol. 28, No. 7, 2001, 843- 853
Using PCA scores to classify species communities: an example for pelagic seabird distribution
1,2
1
1
F. HUETTMANN & A. W. DIAMOND , Atlantic Cooperative Wildlife Ecology Research Network (ACWERN), University of New Brunswick, Canada and 2 Centre for Wildlife Ecology, Simon Fraser University, Canada
abstract Using Principal Component Analysis (PCA) in order to classify animal communities from transect counts is a widely used method. One problem with this approach is determining an appropriate cut-oþ point on the Principal Component (PC) axis to separate communities. We have developed a method using the distribution of PC scores of individual species along transects from the PIROP (Programme InteÂgre de Recherches sur les Oiseaux PeÂlagiques) database for seabirds at sea in the Northwest Atlantic in winter 1965- 1992. This method can be applied generally to wildlife species, and also facilitates the evaluation, justi® cation and strati® cation of PCs and community classi® cations in a transparent way. A typical application of this method is shown for three Principal Components; spatial implications of the cut-oþ decision for PCs are also discussed, e.g. for habitat studies. 1 Introduction The characterization of animal communities from standardized survey counts is not trivial since it comprises a classi® cation of biological data. This requires drawing true borderlines between the occurrence (e.g. presence, absence) of diþ erent animal species in a counting unit (e.g. at a speci® ed location and time), and involves `distinguishing noise from the inherent signal’ ( Jackson, 1993). This problem is central to research involving the de® nition of communities and classes (Wiens, 1989; Rita & Peuhkuri, 1996), particularly when the counts and communities are visualized in space, e.g. using a Geographic Information System (GIS) Correspondence: F. Huettmann, Centre for Wildlife Ecology, Department of Biological Sciences, 8888 University Drive, Simon Fraser University, Burnaby/Vancouver BC, Canada V5A 1S6. E-mail:
[email protected]. ISSN 0266-4763 print; 1360-0532 online/01/070843-11 DOI: 10.1080/02664760120074933
© 2001 Taylor & Francis Ltd
844
F. Huettmann & A. W. Diamond
(Brown et al., 1996). The various approaches that have been used include neural networks (Barndorþ -Nielsen et al., 1993; Scardi, 1996; Huettmann 2000a), arti® cial intelligence (Huettmann, 1996), classi® cation and regression trees (Miller, 1994; Venables & Ripley, 1994; Miller & Ribic, 1995; Bell, 1996; O’ Connor et al., 1996; O’ Connor & Jones, 1997; Huettmann & Lock, 1997), TWINSPAN (van Tongeren, 1995; Kirk et al., 1996), Cluster Analysis (Poulin et al., 1994), Recurrent Group Analysis (Alexander & McLaughlin, 1997) and Principal Component Analysis (PCA) (Skov et al., 1995; Brown et al., 1996; Ballance et al., 1997). We are using data from the PIROP (Programme InteÂgre de Recherches sur les Oiseaux PeÂlagiques) database (Brown et al., 1975; Brown, 1986; Diamond et al., 1993; Lock et al., 1994; Huettmann & Lock, 1997, Huettmann & Diamond, 2000, in press) of seabirds at sea oþ eastern Atlantic Canada to investigate their distribution patterns; here we explore the use of PCA to determine species assemblages. Finding the appropriate methodology is particularly important since species assemblages of seabirds for the Northwest Atlantic have not yet been described (Powers & Brown, 1987; see De Graaf et al., 1985 for foraging guilds; Ainley et al., 1995; Ribic & Ainley, 1997; Ballance et al., 1997; Mills 1998 for West American waters). In a recent study of seabird distribution patterns in the Paci® c Ocean, Ballance et al. (1997) used PCA to characterize seabird communities. Based on transect counts they identi® ed the seabird communities by determining cut-oþ points along the abscissa of a histogram of Principal Component (PC) scores (see Figures 1, 3, 5). Their choice of cut-oþ points, or `natural breaks’ , to determine community boundaries was essentially subjective, rather than biologically based, transparent or quantitative (Ballance et al., 1997; Ballance in litt. 23 October 1997). We build on this approach by proposing a more transparent and objective method that considers the summarized occurrence of species in de® ned counting units with the same PC scores, as identi® ed from the PCA analysis, along the PC score distribution (see Figures 2, 4, 6). We include three examples to demonstrate the method and compare it with that used by Ballance et al. (1997).
2 Methods We used standardized 10-min counts (`watches’ ) of birds at sea oþ the coasts of eastern and arctic Canada from the PIROP database (Lock et al., 1994). Here, we use the winter counts (November - February) from a 26 year time period, 19651992, and present the seabird assemblages for counts excluding the presence of vessels since ships might be expected to re¯ ect diþ erent assemblages of species (e.g. Nettleship et al., 1984; Camphuysen et al., 1995; Garthe & Hueppop, 1996). The full PIROP database was queried within these winter months for vessel presence and bird behaviour associated with vessels, using only standard watch conditions (10 minutes long, ship speed of at least 10 knots, good weather, 180 degrees frontal count, exact species identi® cation; see Huettmann & Lock, 1997). This left 2889 `high quality’ watches from the original 13 761 watches in the PIROP database for the speci® ed time period. Using these high quality watches ensures that representative conclusions can be drawn. The queried raw counts were log transformed ( + 1 log10 ), using Visual FoxPro 5 (Siegel, 1994) and SPLUS, version 4.1, (Venables & Ripley, 1994; StatSci Division, 1995, 1996). The PCA analysis was done in SPLUS and the PCs were identi® ed using the Kaiser criterion as a stopping rule (90% threshold of cumulative explained variance, Jackson, 1993;
845
PCA scores to classify communities
Table 1. Amount of variance accounted for by Principal Components of seabird counts during winter 1966- 1992. Using the Kaiser criterion, the ® rst four PCs (shown in bold) are selected for further analysis Component No. 1 Standard deviation Proportion of variance Cumulative proportion of variance Eigenvalues Mean of the vector of eigenvalues
2
3
4
5
6
7
8
9
10
0.077 0.1 0 0
0.02 0
0.541 0.487 0.365 0.24 0.372 0.302 0.17 0.1
0.169 0.119 0.1 0.04 0.02 0.01
0.372 0.675 0.846 0.92 0.293 0.238 0.133 0.1
0.958 0.976 0.988 1 0.03 0.01 0 0
1 0
1 0
0.08
0.08
0.08
0.08
0.08
0.08
0.08
0.08
0.08
0.08
Table 2. Loadings of Principal Components 1- 4 winter (loadings < 0.1 are not shown). Seabird Species Northern Fulmar (Fulmarus glacialis) Black-legged Kittiwake (Rissa tridactyla) Herring Gull (Larus argentatus) Iceland Gull (L. glaucoides) Glaucous Gull (L. hyperboreus) Great Black-backed Gull (L. marinus) Dovekie (Alle alle) Thick-billed Murre (Uria lomvia) Interpretation
PC1
PC2
PC3
PC4
2 0.558
2 0.321
0.756
0.109
2 0.639
2 0.368
2 0.648
0.159 2 0.35
2 0.182
0.493
2 0.916 2 0.868
Strong dominance of Dovekie
Absence of Northern Fulmar Dovekie, Northern but no BlackFulmar or Blacklegged Kittiwake legged Kittiwake
Graph included as an example for the diþ erence between `Ballance’ cut-oþ and our method (Figs 1, 2 and 7)
Graph included as an example for the diþ erence between `Ballance’ cut-oþ and our method (Figs 3 and 4)
Absence of large gulls (Laridae), but presence of Northern Fulmar and Black-legged Kittiwake
Graph included as an example for an overlapping `grey zone’ (Figs 5 and 6)
see Table 1). This resulted in 4 PCs for watches without the impact of vessels. We then de® ned the seabird species assemblages of each PC using the values of the loadings (see Table 2). The PC scores were then attached to each individual watch, using a relational database with a linked key in Visual Fox Pro. As shown below, this allows us to evaluate how well each individual watch matches the speci® c PC. Using SQL (Standard Query Language) database queries, these new ® les were then `grouped by’ PC score values and by seabird species so that the overall counts
846
F. Huettmann & A. W. Diamond
for each seabird species were summarized for each PC score. The seabird species abundances of the identi® ed assemblages were then plotted along the gradient of PC scores. PC1, PC2 and PC3 (Figures 2, 4, 6) show typical results of our approach. Instead of using the shape of the overall distribution of PC scores to determine the suggested limits of a seabird community (Ballance et al., 1997; hereafter called the `Ballance’ cut-oþ ), we used the PC scores of separate species, as identi® ed in the PC, to determine the cut-oþ point that identi® es the seabird assemblage (hereafter called `our cut-oþ ’ ). 3 Results and discussion The following interpretation of PC loadings and determination of PCs was done taking the PCs into an overall context. We identi® ed and characterized four PCs (Table 1, Table 2). The PC score histograms of the individual species clustered together by the PCA suggest more clearly which score value presents a cut-oþ point, for a PC community, than simply the distribution of PC scores alone (Ballance et al., 1997). These cut-oþ values, or thresholds, can be used to stratify the data set. Figure 1 shows the distribution of scores of PC1, identi® ed from Table 2 as a `community’ and characterized by a strong dominance of Dovekies (Alle alle). The ® rst most obvious discontinuity in scores (the `Ballance’ cut-oþ ) is at 0.4 (Fig. 1), but this does not correspond to an obvious discontinuity in Fig. 2 for PC scores of Dovekie, in which Dovekies are present in most counts with a PC1 score of 2 0.9 or higher. A similar situation can be found for PC2, where the `Ballance’ cut-oþ in Fig. 3 lays between 2 0.8 and 2 0.7, which does not match with our cut-oþ determined in Fig. 4.
Fig. 1. No. of watches per score for PC1.
PCA scores to classify communities
Fig. 2. No. of birds (Dovekie) per score for PC1.
Fig. 3. No. of watches per score for PC2.
847
848
F. Huettmann & A. W. Diamond
Fig. 4. No. of birds (Dovekie, Black-legged Kittiwake, Northern Fulmar) per score for PC2.
Table 2 shows that the community represented by PC2 is characterized by the absence of three species: Northern Fulmar (Fulmarus glacialis), Black-legged Kittiwake (Rissa tridactyla) and Dovekie. Figure 4 shows that all three species are present at PC2 scores of + 0.5 and below, which becomes our cut-oþ score for this community. Setting a PC score cut-oþ at the far end of the PC-Score spectrum, as seen here, can be necessary if a pure presence/absence pattern is required; but by doing so sample sizes can get small. Such cut-oþ locations can become necessary when PCs are not clear-cut, have a higher degree of bias, overlap, or when some animals are only represented in small numbers in the data set. However, this is not a problem for the PC score method as such; exact implications of such a situation depend on the speci® c application and sample size, e.g. when the importance of sample size would play a role for such a stratum as determined by the thresholds. The third Principal Component is characterized by the presence of Northern Fulmar and the absence of Black-legged Kittiwake (Table 2). It is diý cult to assign a `Ballance’ cut-oþ from Fig 5; PC scores of + 0.1, 2 0.1 or 2 0.4 seem equally appropriate. However, Fig. 6 clearly shows that Black-legged Kittiwakes are absent, and Northern Fulmars present, at PC3 scores of + 0.79 and above. Our method shows that species-abundance histograms can also indicate the existence of overlapping, or `grey zones’ ; a situation that occurs often in biological classi® cation processes, where no community is clearly de® ned and species occurrences overlap (Fig. 6). Our method allows identi® cation of such areas along a PC Score gradient, and it facilitates a justi® cation of potential cut-oþ s within them. Since species are directly linked with 10 minute watches (or any other `bin’ used in transect studies), watches with ambiguous species composition can be excluded,
PCA scores to classify communities
Fig. 5. No. of watches per score for PC3.
Fig. 6. No. of birds (Black-legged Kittiwake, Northern Fulmar) per score for PC3.
849
850
F. Huettmann & A. W. Diamond
Fig. 7. Spatial impact for PC1 (Dovekies) when `Ballance’ cut-oþ and our cut-oþ is used.
or else extracted for further analysis by using their PC score values as a threshold or identi® cation. An erratic distribution of PC scores with gaps in between, of which a typical case is shown in Fig. 2, may indicate an irruptive species appearing irregularly in the study area, such as is the case for example with Dovekies and `wrecks’ (a larger ¯ ock of seabirds found exhausted, or dead, on the shore, caused by severe weather events; Stenhouse & Montevecchi, 1996; Gaston & Jones, 1998). Gaps within the PC score distribution could also suggest possible methodological errors during the sampling procedure for these watches, e.g. the species was overlooked during transect surveys, since it is present on higher and lower PC scores. For spatial and
PCA scores to classify communities
851
Geographic Information Systems (GIS) habitat applications, these gaps within the PC score distribution may also indicate that the locations of watches with this particular PC score identify a potential habitat where a species can be expected to occur (Fig. 2); and therefore such locations should not easily be excluded as an `absence habitat’ (compare also Huettmann 2000b). An example of such a situation is shown in Fig. 7, emphasizing the spatial impact due to the diþ erences of `Ballance’ cut-oþ versus our cut-oþ value. Watches that would be explained by using the `Ballance’ cut-oþ are all included in the area encompassed by the distribution of watches as identi® ed by our cut-oþ . This suggest that our cut-oþ gives a more complete picture of the pelagic Dovekie distribution in winter (compare also Stenhouse & Montevecchi, 1996; Gaston & Jones, 1998). PC frequency histograms of the species also help to evaluate how well the species compositions of the communities of the separate PCs were determined earlier from the PC loadings (Fig. 4). We ® nd that our method proves particularly useful when loadings are low, not strongly diþ erent or overlapping. In addition, visualizing the PC scores on a map helps to evaluate where the typical communities can be found in space (Fig. 7, Brown et al., 1996; O’ Connor, personal communication). Therefore, we conclude that our suggested interpretation of PCA scores for determining cut-oþ thresholds for species within species assemblages derived from transect counts is more objective and transparent than previous methods. Our approach is of general use for wildlife biologists to describe, evaluate and to stratify species communities. Using this method in a spatial context - e.g. for GIS habitat studies - awaits further and in-depth applications.
Acknowledgements We thank the referee for improvements to the manuscript. FH thanks R. G. B. Brown from the Canadian Wildlife Service for collecting PIROP data over 26 years, and L. Wuest and other ACWERN colleagues, R. J. O’ Connor, SPLUS listserv community, T. Lock, F. Cooke, and J. and S. Linke for their ideas, stimulating thoughts and support of this study. This is ACWERN Publication UN B-14. REFERENCES Ainley, D. G., Veit, R., Allen, S. G., Spear, L. B. & Pyle, P. (1995) Variations in marine bird communities of the California current, 1986- 1994. California Cooperative Oceanic Fisheries Investigations Reports, 36, pp. 72- 77. Alexander, S. & McLaughlin, J. D. (1997) A comparison of the helminth communities in Anas undulata, Anas erythrorhyncha, Anas capensis and Anas smithii at Barberspan, SouthAfrica. Onderstepoort Journal of Veterinary Research, 64, pp. 161- 173. Ballance, L. T., Pitman, R. L. & Reilly, S. B. (1997) Seabird community structure along aproductivity gradient: importance of competition and energetic constraint. Ecology, 78, pp. 1502- 1518. Barndorff-Nielsen, O. E., Jensen, J. L. & Kendall, W. S. (Eds) (1993) Networks and ChaosÐ Statistical and Probabilistic Aspects Monographs on Statistics and Applied Probability 50 (London, Chapman & Hall). Bell, J. F. (1996) Application of classi® cation trees to the habitat preference of upland birds. Journal of Applied Statistics, 23, pp. 349- 359. Brown, R. G. B. (1986) Revised Atlas of Eastern Canadian Seabirds (Bedford Institute of Oceanography, Canadian Wildlife Service, Halifax). Brown, R. G. B., Nettleship, D. N., Germain, P., Tull, C. E. & Davis, T. (1975). Atlas of Eastern Canadian Seabirds (Canadian Wildlife Service, Ottawa).
852
F. Huettmann & A. W. Diamond
Brown, S. K., Mahon, R., Zwanenburg, K. C. T., Buja, K. R., Claflin, L. W., O’ Boyle, R. N., Atkinson, B., Sinclair, G., Howell, G. & Monaco, M. E. (1996) East Coast of North America Ground® sh: Initial explorations of biogeography and species assemblages (Silver Spring, MD: National Oceanic and Atmospheric Administration, and Dartmouth, NS, Department of Fisheries and Oceans). Camphuysen, C. J., Calvo, B. Durinck, J., Ensor, K., Follestad, A., Furness, R. W., Leaper, G., Skov, H., Tasker, M. L. & Winter, C. J. N. (1995) Consumption of Discards by Seabirds in the North Sea. (Final report EC DG XIV research contract IOECO/93/10.NIOZ-Report 1995- 5. Netherlands Institute for Sea Research /Texel). De Graaf, R. M., Tilghman, N. G. & Anderson, S. H. (1985) Foraging guilds of North American birds. Environmental Management, 9, pp. 493- 536. Diamond, A. W., Gaston, A. J. & Brown, R. G. B. (1993) A model of the energy demands of the seabirds of eastern Arctic Canada. In: W. A. montevecchi (Ed.), Studies of high-latitude seabirds, Occasional Paper Number 77 (Ottawa, Canadian Wildlife Service). Garthe, S. & Hueppop, O. (1996) Distribution of ship-following seabirds and their utilization of discards in the North Sea in summer. Marine Ecology Progress, 106, pp. 1- 9. Gaston, A. J. & Jones, I. L. (1998) The Auks (Oxford, Oxford University Press). Huettmann, F. (1996) Recognizing animal species with Arti® cial Intelligence (AI) Software on digitized video pictures; an application using roe deer and red fox. In N. Botev, S. Golovatch & L. Penev (Eds) Proceedings of the XXII IUGB Congress, So® a, Bulgaria, August 1995, pp. 129- 138 (International Union of Game Biologists). Huettmann, F. (2000a). Making use of public large-scale environmental databases from the WWW and a GIS for georeferenced prediction modelling: a research application using Generalized Linear Models, Classi® cation and Regression Tress, and Neural Networks. In Tochtermann, K. & Riekert, W.-F. (Eds) `Hypermedia im Umweltschutz’ Proceedings of 3. Workshop Umweltinformatik. Deutsche Gesellschaft fuÈr Informatik (GI) and Forschungsinstitut fuÈr anwendungsorientierte Wissensverarbeitung (FAW) Ulm. Umwelt-Informatik aktuell; Metropolis Verlag, Marburg. pp. 281- 292. Huettmann, F. (2000b) Environmental Determination of Seabird Distribution. Unpublished PhD Thesis. University of New Brunswick, Fredericton, 415 p. Huettmann, F. & Diamond, A. W. (2000) Seabird migration in the Canadian Northwest Atlantic Ocean: moulting locations and movement patterns of immature birds. Canadian Journal for Zoology, 78, pp. 624- 647. Huettmann, F. & Diamond, A. W. (in press) The importance of seabird colony locations in the Northwest Atlantic: towards a spatially explicit seabird distribution model. Ecological Modelling. Huettmann, F. & Lock, A. R. (1997). A new software system for the PIROP database; data ¯ ow and an approach for the seabird-depth analysis. ICES Journal of Marine Science, 54, pp. 518- 523. Jackson, D. A. (1993) Stopping rules in Principal Component Analysis: A comparison of heuristical and statistical approaches. Ecology, 74, pp. 2204- 2214. Kirk, D. A., Diamond, A. W., Hobson, K. A. & Smith, A. R. (1996) Breeding bird communities of the western and northern Canadian boreal forest: relationship to forest type. Canadian Journal for Zoology, 74, pp. 1749- 1770. Lock, A. R., Brown, R. G. B. & Gerriets, S. H. (1994) Gazetteer of Marine Birds in Atlantic Canada. (Halifax, Canadian Wildlife Service). Miller, T. (1994). Model selection in tree-structured regression. Proceedings of the Statistical Computing Section of the American Statistical Association, pp. 158- 163. Miller, T. & Ribic, C. (1995) Tree-Structured variable selection methods. Proceedings of the Statistical Computing Section of the American Statistical Association, pp. 142- 147. Mills, K. (1998) Multispecies seabird feeding ¯ ocks in the Galapagos Islands. Condor, 100, pp. 277285. Nettleship, D., Sanger, G. A. & Springer, P. F. (Eds) (1984) Marine birds: their feeding ecology and commercial ® sheries relationships. Proceedings of the Paci® c Seabird Group Symposium, Seattle, Washington, 6- 8 January 1982. Special publication by the Canadian Wildlife Service for the Paci® c Seabird Group, Ottawa. O’ Connor, R. & Jones, M. T. (1997) Using hierarchical models to index the ecological health of the nation. 62nd Transactions of the North American Wildlife and Natural Resource Conference, pp. 501- 508. O’ Connor, R., Jones, M. T. , White, D., Hunsaker, C., Loveland, T., Jonesa, B. & Preston, E. (1996) Spatial partitioning of environmental correlates of avian biodiversity in the conterminous United States. Biodiversity Letters, 3, pp 97- 110. Poulin, B., Lefebvre, G. & McNeil, R. (1994) Characteristics of feeding guilds and variation in diets of bird species of three adjacent tropical sites. Biotropica, 26, pp. 187- 197.
PCA scores to classify communities
853
Powers, K. D. & Brown, R. G. B. (1987) Seabirds. In R. H. Backus & D. W. Bourne (Eds) Georges Bank, pp. 359- 371 (Cambridge, MIT Press). Ribic, C. & Ainley, D. G. (1997) The relationships of seabird assemblages to physical habitat features in Paci® c equatorial waters during spring 1984- 1991. ICES Journal of Marine Science, 54, pp. 593- 599. Rita, H. & Peuhkuri, N. (1996). Competition in foraging groups. Oikos, 76, pp. 583- 586. Scardi, M. (1996) Arti® cial neural networks as empirical models for estimating phytoplankton production. Marine Ecology Progress Series, 139, pp. 289- 299. Siegel, C. (1994) Mastering FoxPro 2.6Ð Special Edition (San Francisco, Sybex). Skov, H., Durinck, J., Danielsen, F. & Bloch, D. (1995) Co-occurrence of cetaceans and seabirds in the Northeast Atlantic. Journal of Biogeography, 22, pp. 71- 88. StatSci Division (1996) S-Plus, Guide to Statistical & Mathematical Analysis. (Seattle, Washington, MathSoft). Stenhouse, I. J. & Montevecchi, W. A. (1996) Winter distribution and wrecks of little auks (dovekies) alle a. alle in the Northwest Atlantic, Sula 10, pp. 219- 228. van Tongeren, O. F. R. (1995) Cluster Analysis. In: R. H. G. Jongman, C. J. F. ter Braak & O. F. R. van Tongeren (Eds), Data Analysis in Community and Landscape Ecology, pp. 174- 212 (Cambridge, Cambridge University Press). Venables, W. N. & Ripley, B. D. (1994) Modern Applied Statistics with S-Plus.2 (New York, Springer Verlag). Wiens, J. A. (1989) The Ecology of Bird Communities. Volume 1. Foundations and patterns (Cambridge, Cambridge University Press).