A data mining approach to dinoflagellate clustering ... - Semantic Scholar

2 downloads 0 Views 2MB Size Report
In the 18S rDNA phylogenetic analyses Pyrocystis lunula and P. noctiluca clustered ... of cholesterol and 4α,24-dimethyl-5α-cholestan-3β-ol (VIIc) in P. lunula.
Int. J. Data Mining and Bioinformatics, Vol. 4, No. 4, 2010

A data mining approach to dinoflagellate clustering according to sterol composition: correlations with evolutionary history Jeffrey D. Leblond* and Andrew D. Lasiter Department of Biology, Middle Tennessee State University, P.O. Box 60, Murfreesboro, TN 37132, USA E-mail: [email protected] E-mail: [email protected] *Corresponding author

Cen Li Department of Computer Science, Middle Tennessee State University, P.O. Box 48, Murfreesboro, TN 37132, USA E-mail: [email protected]

Ramiro Logares Limnology Department, Uppsala University, Uppsala, SE-75123, Sweden E-mail: [email protected]

Karin Rengefors Limnology Division, Department of Ecology, Lund University, Lund, SE-22362, Sweden E-mail: [email protected]

Terence J. Evens USDA-ARS, United States Horticultural Research Laboratory, 2001 South Rock Rd., Ft. Pierce, FL 34945, USA E-mail: [email protected] Abstract: This study examined the sterol compositions of 102 dinoflagellates using clustering and cluster validation techniques, as a means of determining the relatedness of the organisms. In addition, dinoflagellate sterol-based relationships were compared statistically to 18S rDNA-based phylogenetic

Copyright © 2010 Inderscience Enterprises Ltd.

431

432

J.D. Leblond et al. relationships using the Mantel test. Our results indicated that the examined dinoflagellates formed six clusters based on sterol composition and that several, but not all, dinoflagellate genera, which formed discrete clusters in the 18S rDNA-based phylogeny, shared similar sterol compositions. This and other correspondences suggest that the sterol compositions of dinoflagellates are explained, to a certain extent, by the evolutionary history of this lineage. Keywords: bioinformatics; clustering; cluster validation; data mining; dinoflagellate; knowledge discovery; phylogeny analysis; sterol. Reference to this paper should be made as follows: Leblond, J.D., Lasiter, A.D., Li, C., Logares R., Rengefors, K. and Evens, T.J. (2010) ‘A data mining approach to dinoflagellate clustering according to sterol composition: correlations with evolutionary history’, Int. J. Data Mining and Bioinformatics, Vol. 4, No. 4, pp.431–451. Biographical notes: Jeffrey D. Leblond is currently an Associate Professor in the Middle Tennessee State University (MTSU) Department of Biology. He received his BS in Environmental Science from the University of Massachusetts in 1993 and his PhD in Microbiology from the University of Tennessee in 1997. His research interest is algal lipids, with particular emphasis on sterols and chloroplast membrane lipids in dinoflagellates. Andrew D. Lasiter is currently a Doctoral student in the Integrated Program in Biomedical Sciences at the University of Tennessee Health Science Center. He received his BS in Biology from Middle Tennessee State University (MTSU) in 2007. His research interests are mechanisms of immunomodulation and autoimmune disease etiology. Cen Li is currently an Associate Professor of the Computer Science Department at MTSU. She received her BS in Computer Science from MTSU in 1993, and her MS and PhD in Computer Science from the Vanderbilt University in 1995 and 2000, respectively. Her research interests are data mining and its applications in bioinformatics and web intelligence, and robotics. Ramiro Logares is a Post-doctoral Researcher of the Evolutionary Biology Centre, Limnology, at Uppsala University, Sweden. He received his MS in Biological Sciences at Universidad del Comahue, Argentina, in 2003 and his PhD in Limnology at Lund University, Sweden, in 2007. His research interests are microbial evolution, diversity and biogeography. Karin Rengefors is currently a Professor in Limnology at the Department of Ecology at Lund University, Sweden. She received her BS in Biology from Uppsala University, Sweden in 1991, where she also received her PhD in Limnology in 1998. Her research interest is in phytoplankton ecology, especially life cycles, genetic diversity and biogeography. Terence J. Evens is currently a Research Ecologist with the US Department of Agriculture, Agricultural Research Service. He received his BA in Natural Sciences from New College of Florida in 1992, and his PhD in Biology from UC Santa Barbara in 1999. His research interests are microalgal physiology, biochemistry and physiological ecology.

A data mining approach to dinoflagellate clustering

1

433

Introduction

Dinoflagellates are an ecologically important and diverse group of unicellular eukaryotic algae (Hallegraeff, 1993; Kirkpatrick et al., 2004). Photosynthetic dinoflagellates, which represent about half of all dinoflagellate species, are important primary producers in marine and freshwater ecosystems, both as free-living cells and as symbionts within corals and other invertebrate animals. Several dinoflagellate species produce toxins, which can cause enormous economic and environmental losses due to human health impacts (Kirkpatrick et al., 2004) and widespread damage to fisheries and marine mammals (Hallegraeff, 1993). For decades, dinoflagellates isolated from several locations around the globe have been examined for production of biomarker sterols to be used for tracking these organisms through time and space. Numerous studies have shown that several genera of dinoflagellates produce dinosterol, a 4α-methyl sterol rarely found in other classes of protists (Volkman, 2003). This sterol and others have been considered as class-specific, i.e., Dinophyceae biomarkers. Some studies have shown that certain dinoflagellates produce sterols that have the potential to serve as genera-specific biomarkers (Leblond and Chapman, 2002; Giner et al., 2003). However, there has never been a synthesis of the wealth of dinoflagellate sterol data to determine whether the distribution of sterols across the class Dinophyceae reflects evolutionary relationships of this class of protists. The objectives of this research were twofold: •

apply data-driven analyses to identify the relationships among dinoflagellates based strictly on sterol compositions



to investigate the correspondences between dinoflagellate sterol compositions and their evolutionary histories as revealed by a Bayesian Inference 18S rDNA-based phylogeny.

2

Sterol data description

Fifty-eight named sterols and sterol ketones identified from 102 dinoflagellate species were used to create a database of species-/strain-specific relative sterol compositions. A total of 19 published surveys and a large amount of previously unpublished data (labelled as study T) were used to create the database. The published studies were as follows: A

Alam et al. (1978)

B

Kokke et al. (1981)

C

Goad and Withers (1982)

D

Kokke et al. (1982)

E

Jones et al. (1983)

F

Alam et al. (1984)

G

Nichols et al. (1984)

H

Harvey et al. (1988)

434

J.D. Leblond et al.

I

Hallegraeff et al. (1991)

J

Piretti et al. (1997)

K

Klein Breteler et al. (1999)

L

Mansour et al. (1999)

M Volkman et al. (1999) N

Leblond and Chapman (2002)

O

Giner et al. (2003)

P

Leblond and Chapman (2004)

Q

Thomson et al. (2004)

R

Leblond et al. (2006a)

S

Leblond et al. (2006b).

The complete list of the dinoflagellate species and the sterols is omitted here due to space limitation. The structures of sterols covered in this study are shown in Figure 1. Figure 1

Structures of sterols in the study1

A data mining approach to dinoflagellate clustering

3

435

Clustering and cluster validation on dinoflagellate sterol composition data

Clustering analysis partitions data into homogeneous groups where the within group data similarity and between group data dissimilarity are maximised (Duda et al., 2000). In many science and engineering applications, clustering is the first step in an explorative data analysis process (Jain et al., 1999). Many clustering algorithms have been developed using different similarity measures, clustering control schemes, and cluster partition selection criteria (Jain et al., 1999; Duda et al., 2000). For many data, the clusters generated from multiple systems do not agree with each other. On the other hand, few works have focused on systematic cluster validation that evaluates the quality of the clusters and generates the optimal clusters for data. In most real world applications, clustering analysis is performed by non-specialists who often lack sufficient knowledge about the differences among various similarity measure(s), or the differences among various clustering systems. Once a set of clusters is generated from a chosen system, they are often accepted without further scrutiny. The analyst then determines how these clusters may be explained using the domain knowledge. Some clustering systems produce a hierarchy of clusters, and is the analyst’s job to figure out which clusters in the hierarchy should be extracted to best partition the data. This problem is studied in this work by applying two different schemes: a bootstrap analysis based cluster validation that focuses on the statistical significance of the clusters during multi-scale, multi-step bootstrap analysis, and a data mining approach that explicitly evaluate the quality of the clusters of different sizes using internal and external criteria.

3.1 Clustering and cluster validation using bootstrap Suzuki and Shimodaira have extended the uncertainty assessment methods used for phylogeny analysis for the purpose of cluster validation (Shimodaira, 2004; Suzuki and Shimodaira, 2006). To determine the statistical significance of the clusters in a clustering hierarchy, thousands of bootstrap samples are generated by randomly sampling elements of the data, and bootstrap replicates of the dendrogram are derived by repeatedly applying the clustering algorithm to them (Efron et al., 1996). The statistical significance of the clusters derived from these replicates is measured by their approximation to the true probability of the existence of the clusters (p-value). Since the true p-value is not computable in all but some special cases, different methods have been developed to estimate the approximate p-value. In this study, Suzuki and Shimodaira’s PvClust package developed in R (R Development Core Team, 2007) (the R project website, http://www.r-project.org/) has been applied to study the cluster structure of the sterol data. The approximate p-values computed in PvClust are the Bootstrap Probability (BP) and the Approximately Unbiased probability (AU). BP measures the frequency that each cluster appears in the replicate dendrograms, and represents a first order estimation of the p-value. AU, computed as the result of a multi-scale bootstrap re-sampling, represents a higher order approximation of the p-value, and is considered a better indicator. The bootstrap analysis was performed by applying PvClust on the sterol data, using the recommended number of bootstraps nboot = 1000. Figure 2 shows the resulting dendrogram with the AU/BP values marked next to each cluster. In a typical case, the AU/BP values should generally increase from the root of the dendrogram to the leave

436

J.D. Leblond et al.

clusters because the similarity of the data objects increases towards the leave clusters. Therefore, a threshold value in AU or BP may be used to determine at which level data objects cease to belong in the same cluster, thus derive the statistically significant set of clusters along the cut off level. On the other hand, for the clusters derived from the sterol data, along almost all the path from the root to the leave clusters, the AU values are all equal to 100, which suggests no significant clustering in the data. Given that we have a priori knowledge that the sterol composition of these dinoflagellate species is quite different, especially among certain groups, this result leads to the belief that the bootstrap type analysis is not suitable for the data at hand, which is most likely due to the highly sparse nature of the data set. This result prompted us to look at other approaches for cluster validation. Figure 2

Cluster validation results on the sterol data using PvClust (AU values are in red, and the BP values are in green) (see online version for colours)

3.2 Cluster validation using data mining methods Clustering analysis is one of the most widely used procedures that partitions data into homogeneous groups. Many different clustering systems have been developed employing different similarity/distance measures, clustering control schemes, and cluster partition selection criteria. Because of the bias introduced through these differences, the clusters generated from different systems often do not agree with each other. Unlike the purely statistical approaches that measure cluster quality based on data-re-sampling, data mining approach evaluates the cluster quality using explicit criterion functions. Clustering results generated from different systems may be evaluated at the same time against the same set of functions, and consensus obtained among different systems may be used for cluster selection. clValid (Brock et al., 2008), an R package for cluster validation was used for this analysis.

A data mining approach to dinoflagellate clustering

437

To represent biases from many different clustering schemes, a group of very different clustering algorithms were employed in this study. The clustering results from these methods are synthesised to determine the final clustering partition. The clustering algorithms used include agglomerative and divisive hierarchical clustering methods, K-means clustering and related Partition Around Medoid (PAM) clustering, mixture model-based clustering, sampling-based clustering, Self-Organising Map (SOM), and Self-Organising Tree Algorithm (SOTA). The hierarchical clustering (hierarchical) system implements the agglomerative UPGMA algorithm with Euclidean distance measure. Each observation is initially placed in a cluster by itself. Clusters are successfully merged based on their similarity, which is determined based on the pre-computed matrix of similarity between pairwise data objects. DIANA implements the divisive hierarchical clustering. In this approach, all the data objects are initially placed in the root cluster. The clusters were successfully divided into child clusters so that each cluster contained a single data object. The K-means clustering performs partitional clustering using Euclidean distance to determine data-to-cluster assignment. The method minimises the mean squared error of data within each cluster. The iterative clustering process starts with a random set of k cluster seeds. Data are assigned to the nearest neighbouring clusters, the cluster centres are updated with the new data distribution, and the data are redistributed. This iterative process stops when the set of clusters converge. The PAM method is similar to the K-means clustering in that both are iterative, partitional clustering methods. Unlike K-means clustering, PAM computes the cluster centres in the form of the ‘most representative’ data in the cluster. CLARA is a sampling-based clustering approach, which implements PAM on a number of sub-datasets. The model-based clustering method implements the Bayesian hierarchical clustering approach with multivariate mixtures of Gaussians. The Expectation Maximisation (EM) algorithm was used for the iterative data-to-cluster assignment and the Bayesian Criterion is used for cluster selection. The SOM method trains a single-layer feed-forward neural network with competitive learning method. It groups data into clusters according to the closeness of the data computed from the properties of data in a reduced dimensional space. The SOTA algorithm clusters data using a divisive hierarchical binary tree structure. Among these clustering methods, hierarchical, K-means, model-based clustering and SOM are the more popular methods. Each algorithm was run nine times to derive data clusters with 2–10 clusters. The quality of the clustering results was compared using the cluster validation measures explained here. The best cluster size determined was that which optimised the validation measures (Dunn, 1974; Kaufman and Rousseeuw, 1990; Handl and Knowles, 2005; Handl et al., 2005). Three types of internal measures were used to compute the correctness of data-to-cluster assignment, the compactness of the clusters, and the separation between clusters. These included the connectivity measure (Handl and Knowles, 2004), the Dunn index (Dunn, 1974), and the Silhouette width index (Kaufman and Rousseeuw, 1990). Connectivity measures the degree of ‘connectedness’ of the clusters in terms of distance among the k-nearest neighbours: the tighter the clusters, the lower the connectivity value. The Dunn index is the ratio between the smallest distances between observations not in the same cluster to the largest intra-cluster distance. The Dunn index should be maximised for the best clustering results. The silhouette width is the average of each observation’s silhouette value and should also be maximised. The silhouette value measures the degree of confidence in the clustering

438

J.D. Leblond et al.

assignment of a particular observation, with well-clustered observations having values near 1 and poorly clustered observations having values near í1. To measure the stability of the generated clusters, clustering was applied repeatedly to noisy data produced by removing one column/feature at a time from the original data. The average difference between the clusters generated from the noisy data and the original data was computed. The stability measures included the Average Proportion of Non-overlapping (APN) measure, the Average Distance (AD) measure, the Average Distance between Means (ADM) measure and the Figure of Merit (FOM) measure (Handl and Knowles, 2005). The APN measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the noisy data. The AD measure computes the AD between observations placed in the same cluster by clustering based on the full data and clustering based on the noisy data. The ADM measure computes the AD between cluster centres for observations placed in the same cluster by clustering based on the full data and clustering based on the noisy data. The FOM measures the average intra-cluster variance of the observations in the designated deleted column, where the clustering is based on the remaining (undeleted) samples.

4

Phylogenetic constructions

18S rDNA sequences representing most dinoflagellate lineages were downloaded from Genbank. In addition, new sequences were obtained for this study from several species/strains according to the procedures outlined in Rogers et al. (2006). These organisms were selected to represent some of the same dinoflagellate strains (i.e., same culture collection numbers) for which sterol data has been produced, and represent approximately one-third (38) of the 102 dinoflagellates in the sterol composition database. The sequences obtained for this study are (study code and Genbank accession number in parentheses): Adenoides eludens NEPCC 683a (T, EF492484), Akashiwo sanguinea CCMP 1740 (N, EF492486), Amphidinium carterae UTEX 1687 (N, EF492485), Coolia monotis CCMP 1345 (N, EF492487), Coolia sp. (N, EF492488), Fragilidium sp. CCMP 1920 (N, EF492489), Gymnodinium simplex NEPCC 736 (N, EF492490), Gymnodinium sp. CCMP 422 (N, EF492491), Gymnodinium sp. 424 (N, EF492492), Gymnodinium sp. CCMP 425 (EF492493), Gymnodinium sp. UTEX 1653 (N, EF492494), Gymnodinium uncatenum NEPCC533 (N, EF492495), Gyrodinium dorsum UTEX 2334 (N, EF492497), Gyrodinium uncatenum NEPCC 607 (N, EF492498), Heterocapsa niei UTEX 1564 (N, EF492499), Heterocapsa pygmaea CCMP 1322 (N, EF492500), Karenia brevis CCMP 718 (O, EF492501), Karenia brevis EPA (N, EF492503), Karenia brevis FMRI (N, EF492504), Karenia brevis NOAA (N, EF492502), Karenia mikimotoi NEPCC 665 (N, EF492505), Karlodinium micrum NEPCC 734 (N, NEPCC734), Kryptoperidinium foliaceum UTEX 1688 (formerly Peridinium foliaceum, EF492508), Lingulodinium polyedra CCMP 1738 (N, EF492507), Peridinium sociale UTEX 1948 (T, EF492509), Prorocentrum mexicanum (N, EF492510), Prorocentrum micans UTEX 1003 (N, EF492511), Prorocentrum triestinum UTEX 1657 (N, EF492512), Scrippsiella trochoidea UTEX 1017 (EF492513), Symbiodinium microadriaticum NEPCC 737 (N, EF492496), Symbiodinium microadriaticum UTEX 2281 (N, EF492514), and Thecadinium inclinatum NEPCC 682 (N, EF492515). In addition to the sequences listed above, the following published

A data mining approach to dinoflagellate clustering

439

sequences were identified as directly relating to cultures for which both sterol and sequence data derive from the same organism: Alexandrium minutum CCMP 113 (N, AY883006), Alexandrium tamarense UTEX 2521 (N, AY883004, Rogers et al. (2006)), Peridinium aciculiferum PAER1 (S, AY970653), Pfiesteria piscicida CCMP 1834 (P, AY121846), Scrippsiella hangoei SHTV-5 (T, AY970662) and Scrippsiella hangoei SHTV-6 (S, EF417316). These sequences were included in the dinoflagellate phylogenetic constructions, but more importantly, were used in a direct statistical comparison between dinoflagellate sterol and 18S rDNA variation. After the elimination of identical and apparently erroneous sequences, an alignment was created using ClustalX (v1.8). The alignment was edited using Gblocks (v0.91b), as well as visual analysis, and consisted of 244 sequences and 1528 characters. Phylogenies were estimated using Bayesian Inference (BI) in MrBayes (v3.1.2), a Metropolis-coupled Markov Chain Monte Carlo model (MCMC) approach for the approximation of Bayesian Posterior Probabilities (PPs) (Huelsenbeck and Ronquist, 2001; Altekar et al., 2004). The hierarchical likelihood ratio test (Huelsenbeck and Crandall, 1997) and the Akaike information criterion, as implemented in ModelTest (v3.7) (Posada and Crandall, 1998) indicated that the General Time Reversible (GTR) model of nucleotide substitution, with a Gamma (ī) distributed rate of variation across sites and a proportion of invariable sites (I) was the most appropriate evolutionary model for our 18S rRNA dataset. The evolutionary model consisted of the GTR + ī + COV. The Covarion (COV) model allows substitution rates to change across positions through time (Huelsenbeck, 2002). Two Bayesian MCMC analyses were run with seven Markov chains (six heated, one cold) for 5 million generations and the trees were sampled every 100 generations, which resulted in 50,000 sampled trees. Each analysis used default (flat) priors and started from random trees. The obtained PP values for the branching pattern as well as likelihood scores for the tree reconstruction were compared to ensure convergence. Consensus trees were calculated using the 3 × 104 trees after the log-likelihood stabilisation (burn-in phase). The tree generated with MrBayes were visualised in TreeView (v1.6.6), and further edited in MEGA (ver. 3.1) (Kumar et al., 2004).

5

Combined analyses of sterol- and 18S rDNA-based dinoflagellate relationships

To test if the differentiation in sterol composition is correlated with the 18S rDNA divergence; comparisons were made between these two datasets for 82 dinoflagellates for which sterol and 18S rDNA data exist for the same species. A second comparison was performed for the 38 strain subset of these 82 species for which sterol and 18S rDNA data exist for the same strain. This eliminates any incongruence between the sterol and rDNA datasets that naturally arise from strain-specific inconsistencies. The standardised Mantel coefficient, which is the product-moment correlation between elements of two similarity matrices, derived from the Z-statistics of MANTEL was computed with the freeware program Manteller (http://dyerlab.bio.vcu.edu/trac/). This procedure is used to estimate the association between two independent similarity matrices and to test whether the association is stronger than would be expected by chance. Similarity matrices of Euclidean distances for the relative sterol compositions of the species-specific (82 species) and strain-specific (38 strains) datasets were created

440

J.D. Leblond et al.

with Primer (v6). In addition, two genetic distance matrices were derived, by computing the uncorrected genetic distances (p) among aligned 18S sequences. The standardised Mantel correlation coefficients between the matrices of the species-specific and strainspecific sterols and their corresponding genetic distance matrices were then calculated. The significance of the test was evaluated by the construction of a null distribution by a Monte-Carlo procedure: 1000 permutations of rows and columns of the sterol Euclidean distances matrix were realised, whereas the genetic distance matrices were kept constant.

6

Results

6.1 Relationships of dinoflagellates based on sterol composition Figure 3 shows the cluster validation results of the sterol composition database using internal measures. The best clustering size, the k value, for the data was determined by looking for the ‘knee’ of the curves (Langan et al., 1998; Handl et al., 2005). In the case where the value of a criteria function increases for a better quality clustering, the knee area refers to the place in the criterion value vs. cluster size plots where a significant increase of the criterion value is observed, which generally levels off (or decreases) after that point. In the case of a minimising criterion function, the ‘knee’ area refers to a significant drop of the criterion value followed by levelling or slower drop of the values. Identifying the knee area from the plots is not a straightforward matter (cf. Figure 3). The results generated from the different systems agree more on certain criterion measures than others, e.g., there is more consensus among results generated from different systems along the Silhouette index than the Dunn index. From generated plots based on the connectivity measure, it was observed that SOM, the model-based clustering, and DIANA have a ‘knee’ area corresponding to cluster size six. After that, the curves levelled off. These three plots were extracted to produce a focused study plot (labelled as such in Figure 3). Results from the other approaches were ignored because they did not display a pronounced ‘knee’ area.2 With the Dunn index, results from K-means, SOTA, and CLARA methods pointed to the best k value of six, as shown in the focused plot. The plot from PAM suggests a k value of seven. The Dunn index values from clustering results obtained from SOM suggest k = 8. The plots from DIANA and the model-based method show flat Dunn index values for all partition sizes, and were ignored. Figure 3

Cluster validation on sterol data using the internal measures (see online version for colours)

A data mining approach to dinoflagellate clustering Figure 3

441

Cluster validation on sterol data using the internal measures (see online version for colours) (continued)

The results from the Silhouette index showed a strong evidence of k = 6 from the hierarchical, the K-means, and DIANA clustering methods. DIANA and SOTA both point to k = 5 as a better alternative. Results from SOM have a ‘knee’ close to k = 3. The model-based method was ignored (the knee curve is not observed; the silhouette value continues to increase after an initial dip at four). Among the three internal measures, the Silhouette measure is considered to have more authoritative value than the other two (Brock et al., 2008). If maximising the Silhouette measure is the only criterion for choosing the best cluster partition, k = 6 would be the answer given all the cluster measurements. This also agrees with the general consensus as discussed above. The cluster validation results using the stability measures are shown in Figure 4. Here, we looked for the cluster size, k, which gives the most stable clustering results. Since all four stability measures should be minimised for best quality clustering; it is important to look for major drops in value of the stability measures. One notable observation was that the hierarchical clustering method consistently has the identified ‘knee’ area point at a cluster size of six using all four stability measures. The model-based clustering indicated k = 4 or k = 6, with a higher evidence of k = 6 observed in the plots from the FOM and the AD measures. With K-means clustering, the results indicated k = 5 with the APN, the ADM, and the AD measures, and k = 6 with the FOM measure (the FOM measure is often considered a stronger indicator for cluster stability than the AD, ADM, and the APN measures). In addition, both CLARA and DIANA showed strong evidence that a cluster number of six minimises the APN measure. For these reasons, we settled on k = 6 as the correct cluster size.

442 Figure 4

J.D. Leblond et al. Cluster validation on sterol data using the external measures (see online version for colours)

A data mining approach to dinoflagellate clustering Figure 5

443

Dendrogram of dinoflagellate relationships based on sterol compositions accompanied by heat map showing sterol distributions. Species names in red are strains for which there were both 18S and sterol data (i.e., the 38 organisms with strain-specific data)1 (see online version for colours)

The UPGMA dendrogram of the six clusters formed based on the sterol profiles is shown in Figure 5. The accompanying heat map illustrates how sterol composition is reflected in these clusters. The structures of the sterols were used to label the clusters and are summarised here: 1

Ring system XIV dinoflagellates A: Contained all examined members of Karenia and Karlodinium. (24R)-4α-Methyl-5α-ergosta-8(14),22-dien-3β-ol (XIVh) and 27-nor-(24R)-4α-methyl-5α-ergosta-8(14),22-dien-3β-ol (XIVs) were the predominant sterols, with Karlodinium micrum and Karenia mikimotoi having a greater relative percentage of XIVh than Karenia brevis.

444

J.D. Leblond et al.

2

Ring system XIV dinoflagellates B: Contained Amphidinium corpulentum and Amphidinium carterae. 4α-Methyl-5α-cholest-8(14)-en-3β-ol (XIVa) and 4α-methyl-5α-ergosta-8(14),24(28)-dien-3β-ol (XIVb), and to a lesser extent 4α,23,24-trimethylcholesta-5α-cholesta-8(14),22-dien-3β-ol (XIVk), were the predominant sterols.

3

Ring systems II and VII dinoflagellates A: Contained Polarella glacialis, Protoceratium reticulatum, Lingulodinium polyedrum, Gymnodinium simplex, and Gymnodinium sp. Cholesta-5,22Z-dien-3β-ol (IIf), 24-methylcholesta-5,22Edien-3β-ol, and 4α,24-dimethyl-5α-cholestan-3β-ol were the predominant sterols.

4

Ring systems I and VII dinoflagellates: Contained Akashiwo sanguinea. 24-Methyl-5α-cholest-22E-en-3β-ol (Ih), 23,24-dimethyl-5α-cholest-22E-en-3β-ol (Ik), and VIIc were the predominant sterols.

5

Ring systems II and VII dinoflagellates B: Contained several dinoflagellate genera, including, but not limited to, a number of species from Alexandrium, Prorocentrum, and Symbiodinium. Cholest-5-en-3β-ol (IIa) and 4α,23,24-trimethyl-5α-cholest-22Een-3β-ol (VIIk) were the predominant sterols.

6

Ring system VII dinoflagellates: Also contained several dinoflagellate genera, including, but not limited to, a number of species from Alexandrium, Gymnodinium, Heterocapsa, Pfiesteria, Prorocentrum, Pyrocystis, and Thoracosphaera. Sterols VIIc, 4α,23,24-trimethyl-5α-cholestan-3β-ol (VIIj), and VIIk were the predominant sterols. The distribution of species from some of these genera in Cluster 5 as well is discussed here in the Discussion.

6.2 Comparison to dinoflagellate groups formed in 18S rRNA phylogenies In order to provide a basis for comparison of sterol-based dinoflagellate groups with genetic-based phylogenies, a phylogenetic tree was created using 244 dinoflagellate 18S rDNA sequences (Figure 6). Because of the separate and heterogeneous origins of the 18S rDNA and sterol databases, the same species (and henceforth strains) were not consistently found in both dendrograms. However, an overall visual comparison of this tree to the dendrogram of 102 dinoflagellates based on sterol composition in Figure 5 shows that several species belonging to genera supported by molecular phylogenies shared similar sterol compositions. The six clusters corresponded to the 18S rDNA phylogeny in the following obvious ways: 1

Karenia and Karlodinium, the two genera found in Cluster 1, grouped near each other with high bootstrap support in the 18S rDNA phylogeny.

2

The Karenia/Karlodinium branch in Figure 5 also included other gymnodinoid dinoflagellates, including species of Amphidinium (see below).

3

With a few exceptions, Amphidinium, the genus representing Cluster 2, formed a tight group with high bootstrap support within the 18S rDNA phylogeny that included no other taxa.

A data mining approach to dinoflagellate clustering

445

4

The dinoflagellates of Cluster 3 represented organisms that did not form a unified group within the 18S rDNA phylogeny. Within the 18S rDNA phylogeny, Polarella glacialis and Gymnodinium simplex were grouped together with high bootstrap support along with several species of the genus Symbiodinium (Cluster 5). Lingulodinium and Protoceratium, the other two genera of Cluster 3, were located elsewhere in the 18S rDNA phylogeny.

5

Akashiwo, the only genus represented in Cluster 4, formed its own group with high bootstrap support in the 18S rDNA phylogeny.

6

Cluster 5 contained a diverse array of dinoflagellate taxa. Within the 18S rDNA phylogeny, these taxa formed the following groups, some with high bootstrap support:

7

a

a group that contained several species of Alexandrium and Coolia monotis

b

a group that contained species of Thecadinium along with other taxa (note that one species of Thecadinium, T. dragescoi was separated from the others in the 18S rDNA phylogeny)

c

a group that contained several species within the genus Symbiodinium

d

a group that contained Prorocentrum mexicanum, Prorocentrum micans, and Prorocentrum triestinum (note that this group also contained organisms from Cluster 6)

e

Adenoides eludens, which did not group closely with any other taxa from Cluster 5 of the sterol dendrogram

f

the genus Amoebophrya, which grouped near Kryptoperidinium foliaceum.

Cluster 6 also contained a diverse array of dinoflagellate taxa. They related to the 18S rDNA dendrogram in the following ways: a

a group that contained several species of Heterocapsa along with two Gymnodinium sp.

b

a group that contained several species of Scrippsiella along with Peridinium aciculiferum and Thoracosphaera heimii

c

a group that contained Gyrodinium dorsum, Gyrodinium uncatenum, and Gymnodinium uncatenum.

To better confirm sterol-18S rDNA dinoflagellate relationships via statistical, rather than visual, means the Mantel test was performed using selected taxa from the two databases. The Mantel test substantiated the SSU-sterol phylogenies by showing that both the species-specific (82 species) and the strain-specific (38 strains) sterol matrices were significantly correlated with their relevant 18s rDNA matrices (r = 0.154, p = 0.031 and r = 0.261, p = 0.008, respectively).

446 Figure 6

J.D. Leblond et al. Consensus SSU rDNA Bayesian phylogeny constructed from an alignment with 244 sequences and 1528 characters under the GTR + ī + COV model. Species names in red are strains for which there were both 18S and sterol data (i.e., the 38 organisms with strain-specific data)1 (see online version for colours)

A data mining approach to dinoflagellate clustering

7

447

Discussion

The study of dinoflagellate sterols has a long and rich history, but it is somewhat disjointed because a comprehensive summarisation of dinoflagellate relationships, according to sterol composition, has not been performed. Our study has utilised clustering techniques to discover a six cluster partition of the dinoflagellates (Figures 2–4) based on sterol composition. The separation of dinoflagellates into Clusters 1 and 2 was primarily due to the possession of a ∆8(14) nuclear unsaturation (ring system XIV) not widely distributed amongst the other sterol clusters. The presence of Amphidinium massartii within Cluster 5 may reveal a misidentification of this organism (Figure 5), or some type of heterogeneity in sterol production within the genus Amphidinium, which appears to be polyphyletic according to our and other rDNA phylogenies (see below). This, however, does not lessen the importance of the fact that organisms within Clusters 1 and 2 were separated from all other organisms primarily because of their ∆8(14) nuclear unsaturations. The differentiation between Cluster 1, containing the genera Karenia and Karlodinium and Cluster 2, containing several members of the genus Amphidinium, was primarily due to differing side chains, and not because of differing nuclear unsaturation patterns. The presence of the ǻ8(14) nuclear unsaturation within Karenia and Karlodinium is most likely due to a common ancestor, since several phylogenies, including the one generated for this study, indicate that Karenia and Karlodinium are closely related (Yoon et al., 2002; Flo Jorgensen et al., 2004; Saldarriaga et al., 2004). The presence of ǻ8(14) within Amphidinium requires a deeper examination since the genus Amphidinium (as defined by Claparède and Lachmann (1859) is most probably polyphyletic (Flo Jorgensen et al., 2004). In some rDNA phylogenies, Karenia, Karlodinium, and some Amphidinium species appear related (Saldarriaga et al., 2004). However, the Amphidinium species that clustered with Karenia/Karlodinium based on sterol composition do not appear related to Karenia/Karlodinium in rDNA-based phylogenies (Flo Jorgensen et al., 2004; Saldarriaga et al., 2004). Thus, more information will be necessary to ascertain if the presence of ∆8(14) within Karenia/Karlodinium and some Amphidinium species is due to convergent evolution or a common ancestor. The separation of Karenia and Karlodinium from all other dinoflagellates correlates with the results obtained in our and other’s rDNA phylogenies, where Karenia and Karlodinium branch at the base of dinoflagellates. This strongly suggests an early evolutionary divergence for this group (Yoon et al., 2002; Saldarriaga et al., 2004). In contrast to Clusters 1 and 2 that are separated based on the relative makeup of unusual sterols, the remaining groups (Clusters 3–6) aggregated due to differences in the relative percentages of more common dinoflagellates sterols. For example, 24-methyl-5α-cholestan-3β-ol (Ic), 23,24-dimethyl-5α-cholest-22E-en-3β-ol (Ik), and 4α,24-dimethyl-5α-cholestan-3β-ol (VIIc) were found not only in A. sanguinea (Cluster 4), but also in a number of other dinoflagellates within Clusters 5 and 6 that are not closely related based on 18S rDNA-based phylogeny (Figures 4 and 5). A similar type of widespread sterol distribution also occurred for cholesta-5,22Z-dien-3β-ol (cis-22-dehydrocholesterol, IIf) and 24-methylcholesta-5,22E-dien-3β-ol (IIh). These represent the dominant sterols in Gymnodinium simplex and the closely related Gymnodinium sp., and Polarella glacialis in Cluster 3 (Figure 5), and were also found as minor components in a number of other distantly related dinoflagellates in Clusters 5 and 6.

448

J.D. Leblond et al.

Interestingly, this group of dinoflagellates is well supported in our and other’s rDNA-based phylogenies and probably corresponds to the Suessiales, an ancient dinoflagellate lineage (Saldarriaga et al., 2004). It is noteworthy that Symbiodinium species did not appear within Cluster 3, as it did in our phylogeny and in other studies (Saldarriaga et al., 2004). It is worth noting that no known members of the Suessiales group produce dinosterol, one of the most common dinoflagellate sterols. Another clearly defined subgroup of Cluster 5 was comprised mostly of species belonging to the order Gonyaucales (i.e., Alexandrium, Gonyaulax, Coolia), which is also supported by Saldarriaga et al. (2004). As mentioned earlier, Alexandrium, Gonyaulax and Coolia possess cholesterol (IIa) and dinosterol (VIIk) as their primary sterols (Figure 5). Of particular note are the two isolates of A. tamarense (Cluster 6) examined in study J that possessed very high proportions of dinosterol and do not appear to produce cholesterol. In the sterol-based phylogenetic analyses, Prorocentrum was separated into two main groups, with the first group containing the species P. hoffmannianum, P. mexicanum, P. micans, and P. triestinum in Cluster 5, and the second group containing P. balticum and P. minimum in Cluster 6 (Figure 5). In general, all of these species of Prorocentrum produce 23,24-dimethylcholesta-5,22-dien-3β-ol (IIk), dinosterol (VIIk), and dinostanol (VIIj), a saturated form of dinosterol. However, the Cluster 5 group produces cholesterol while the Cluster 6 group does not. Conversely, the Cluster 6 group produces 24-methylcholesta-5,24(28)-dien-3β-ol (24-methylenecholesterol, IIb) while the Cluster 5 group does not. In our 18S rDNA phylogenetic analysis (Figure 4), all Prorocentrum species were clustered together. However, the monophyly of Prorocentrum is under discussion (Saldarriaga et al., 2004; Zhang et al., 2007). Thus, the separation of Prorocentrum into two sterol groups based on sterol composition might reflect the potential polyphyletic status of this genus. In the 18S rDNA phylogenetic analyses Pyrocystis lunula and P. noctiluca clustered together (Figure 6). However, the same two species segregated into two clusters based on sterol composition (Figure 5). This separation was primarily due to the presence of high levels of dinosterol in P. noctiluca (Cluster 6) and its absence in P. lunula (Cluster 5), and high levels of cholesterol and 4α,24-dimethyl-5α-cholestan-3β-ol (VIIc) in P. lunula with their relative absence in P. noctiluca. The sterols of Pyrocystis fusiformis mirrored those of P. noctiluca, which led to both species appearing in sterol Cluster 6 (Figure 5). All species belonging to the genera Heterocapsa clustered together in Cluster 6 (Figure 5), which agrees with our 18S rDNA phylogenetic analyses (Figure 6). The species Peridinium willei did not cluster with any of the other analysed Peridinium species in the sterol analyses primarily because P. willei of the possession of 4α,23,24-trimethyl-5α-cholest-24(28)-en-3β-ol (VIIm) and 4α,23,24-trimethylcholesta5,22E-dien-3β-ol (dehydrodinosterol, Xk) as primary sterols. A similar pattern was observed for P. willei in our rDNA phylogeny (Figure 6), where this, and other related Peridinium species, formed a highly supported cluster, which segregated from all the other analysed dinoflagellate lineages. In Logares et al. (2007), it is suggested that P. willei and related Peridinium species diverged long ago from other dinoflagellate lineages.

A data mining approach to dinoflagellate clustering

449

Acknowledgements Sterol data contributed by Jeremy Dahmen were greatly appreciated. Internal MTSU grants supported portions of this work. Mention of trade names or commercial products in this paper is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the US Department of Agriculture. Phylogenetic analyses were carried out at University of Oslo, Bioportal (http://www. bioportal.uio.no/). Financial support to RL was provided by Lund University PhD position and the MICROBIOMICS consortium (http://www.microbiomics.se/).

References Alam, M., Sanduja, R., Watson, D.A. and Loeblich III, A.R. (1984) ‘Sterol distribution in the genus Heterocapsa (Pyrrophyta)’, J. Phycol., Vol. 20, pp.331–335. Alam, M., Sansing, T.B., Busby, E.L., Martiniz, D.R. and Ray, S.M. (1978) ‘Dinoflagellate sterols I: sterol composition of the dinoflagellates of Gonyaulax species’, Steroids, Vol. 33, pp.197–203. Altekar, G., Dwarkadas, S., Huelsenbeck, J.P. and Ronquist, F. (2004) ‘Parallel metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference’, Bioinformatics, Vol. 20, pp.407–415. Brock, G., Pihur, V., Datta, S. and Datta S. (2008) ‘clValid: an R package for cluster validation’, Journal of Statistical Software, Vol. 25, No. 4, pp.1–22. Claparède, E. and Lachmann, J. (1859) ‘Études sur les infusoires et les rhizopodes’, Mém. Inst. Nat. Génèvois., Vol. 6, pp.261–482. Duda, R.O., Hart P.E. and Stork, D.G. (2000) Pattern Classification, 2nd ed., John Wiley & Sons, Hoboken, NJ, USA. Dunn, J.C. (1974) ‘Well separated clusters and optimal fuzzy partitions’, J. Cybernetics, Vol. 4, pp.95–104. Efron, B., Halloran, E. and Holmes, S. (1996) ‘Bootstrap confidence levels for phylogenetic trees’, Proc. Natl. Acad. Sci., USA, Vol. 93, pp.13429–13434. Flo Jorgensen, M., Murray, S. and Daugbjerg, N. (2004) ‘Amphidinium revisited, I. Redefinition of Amphidinium (Dinophyceae) based on cladistic and molecular phylogenetic analyses’, J. Phycol., Vol. 40, pp.351–365. Giner, J-L., Faraldos, J.A. and Boyer, G.L. (2003) ‘Novel sterols of the toxic dinoflagellate Karenia brevis (Dinophyceae): a defensive function for unusual marine sterols?’, J. Phycol., Vol. 39, pp.315–319. Goad, L.J. and Withers, N. (1982) ‘Identification of 27-nor-(24R)-24-methylcholesta-5,22-dien-3 β-ol and brassicasterol as the major sterols of the marine dinoflagellate Gymnodinium simplex’, Lipids, Vol. 17, pp.853–858. Hallegraeff, G.M. (1993) ‘A review of harmful algal blooms and their apparent global increase’, Phycologia., Vol. 32, pp.79–99. Hallegraeff, G.M., Nichols, P.D., Volkman, J.K., Blackburn, S.I. and Everitt, D.A. (1991) ‘Pigments, fatty acids, and sterols of the toxic dinoflagellate Gymnodinium catenatum’, J. Phycol., Vol. 27, pp.591–599. Handl, J. and Knowles, J. (2005) ‘Exploiting the trade-off – the benefits of multiple objectives in data clustering’, in Coello, L.A., Hernández, A.A. and Zitzler, E. (Eds.): Proceedings of the Third International Conference on Evolutionary Multicriterion Optimization, Springer-Verlag, Berlin, pp.547–560.

450

J.D. Leblond et al.

Handl, J., Knowles, J. and Kell, D.B. (2005) ‘Computational cluster validation in post-genomic data analysis’, Bioinformatics, Vol. 21, pp.3201–3212. Harvey, H.R., Bradshaw, S.A., O’Hara, S.C.M., Eglinton, G. and Corner, E.D.S. (1988) ‘Lipid composition of the marine dinoflagellate Scrippsiella trochoidea’, Phytochemistry, Vol. 27, pp.1723–1729. Huelsenbeck, J.P. (2002) ‘Testing a covariotide model of DNA substitution’, Mol. Biol. Evol., Vol. 19, pp.698–707. Huelsenbeck, J.P. and Crandall, K.A. (1997) ‘Phylogeny estimation and hypothesis testing using maximum likelihood’, Ann. Rev. Ecol. Syst., Vol. 28, pp.437–466. Huelsenbeck, J.P. and Ronquist, F. (2001) ‘MRBAYES: Bayesian inference of phylogenetic trees’, Bioinformatics, Vol. 17, pp.754–755. Jain, A.K., Murty, M.N. and Flynn, P.J. (1999) ‘Data clustering: a review’, ACM Computing Surveys, Vol. 31, pp.264–323. Jones, G.J., Nichols, P.D. and Johns, R.B. (1983) ‘The lipid composition of Thoracosphaera heimii: evidence for inclusion in the Dinophyceae’, J. Phycol., Vol. 19, pp.416–420. Jones, G.J., Nichols, P.D. and Shaw, P.M. (1994) ‘Analysis of microbial sterols and hopanoids’, in Goodfellow, M. and O’Donnel, A.G. (Eds.): Chemical Methods in Prokaryotic Systematics, John Wiley & Sons, New York, pp.163–295. Kaufman, L. and Rousseeuw, P.J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, Hoboken, NJ, USA. Kirkpatrick, B., Fleming, L.A., Squicciarini, D., Backer, L.C., Clark, R., Abraham, W., Benson, J., Chenge, Y.S., Johnson, D., Pierce, R., Zaias, J., Bossart, G.D. and Baden, D.G. (2004) ‘Literature review of Florida red tide: implications for human health effects’, Harmful Algae, Vol. 3, pp.99–115. Klein Breteler, W.C.M., Schogt, N., Baas, M., Schouten, S. and Kraay, G.W. (1999) ‘Trophic upgrading of food quality by protozoans enhancing copepod growth: role of essential lipids’, Mar. Biol., Vol. 135, pp.191–198. Kokke, W.C.M.C., Fenical, W. and Djerassi, C. (1981) ‘Sterols with unusual nuclear unsaturation from three cultured marine dinoflagellates’, Phytochemistry, Vol. 20, pp.127–134. Kokke, W.C.M.C., Fenical, W. and Djerassi, C. (1982) ‘Sterols of the cultured dinoflagellate’, Pyrocystis lunula’, Steroids, Vol. 40, pp.307–318. Kumar, S., Tamura, K. and Nei, M. (2004) ‘MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignments’, Briefings in Bioinformatics, Vol. 5, pp.150–163. Langan, D.A., Modestino, J.W. and Zhang, J. (1998) ‘Cluster validation for unsupervised stochastic model-based image segmentation’, IEEE Transactions on Image Processing, Vol. 7, pp.180–195. Leblond, J.D. and Chapman, P.J. (2002) ‘A survey of the sterol composition of the marine dinoflagellates Karenia brevis, Karenia mikimotoi, and Karlodinium micrum: distribution of sterols within other members of the class Dinophyceae’, J. Phycol., Vol. 38, pp.670–682. Leblond, J.D. and Chapman, P.J. (2004) ‘Sterols of the heterotrophic dinoflagellate, Pfiesteria piscicida (Dinophyceae): Is there a lipid biomarker?’, J. Phycol., Vol. 40, pp.104–111. Leblond, J.D., Anderson, B., Kofink, D., Logares, R., Rengefors, K. and Kremp, A. (2006a) ‘Fatty acid and sterol composition of two evolutionarily closely related dinoflagellate morphospecies from cold Scandinavian brackish and freshwaters’, Eur. J. Phycol., Vol. 41, pp.303–311. Leblond, J.D., Sengco, M.R., Sickman, J.O., Dahmen, J.D. and Anderson, D.A. (2006b) ‘Sterols of the syndinian dinoflagellate, Amoebophrya sp., a parasite of the dinoflagellate Alexandrium tamarense (Dinophyceae)’, J. Eukaryot. Microbiol., Vol. 53, pp.211–216.

A data mining approach to dinoflagellate clustering

451

Logares, R., Shalchian-Tabrizi, K., Boltovskoy, A. and Rengefors, K. (2007) ‘Extensive dinoflagellate phylogenies indicate infrequent marine-freshwater transitions’, Mol. Phyl. Evol., Vol. 45, pp.887–903. Mansour, M.P., Volkman, J.K., Jackson, A.E. and Blackburn, S.I. (1999) ‘The fatty acid and sterol composition of five marine dinoflagellates’, J. Phycol., Vol. 35, pp.710–720. Nichols, P.D., Jones, G.J., de Leeuw, J.W. and Johns, R.B. (1984) ‘The fatty acid and sterol composition of two marine dinoflagellates’, Phytochemistry, Vol. 23, pp.1043–1047. Piretti, M.V., Pagliuca G., Boni, L., Pistocchi, R., Diamante, M. and Gazzotti, T. (1997) ‘Investigation of 4-methyl sterols from cultured dinoflagellate algal strains’, J. Phycol., Vol. 33, pp.61–67. Posada, D. And Crandall, K.A. (1998) ‘Modeltest: testing the model of DNA substitution’, Bioinformatics, Vol. 14, pp.817–818. R Development Core Team (2007) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, URL http://www.Rproject.org Rogers, J.E., Leblond, J.D. and Moncreiff, C.A. (2006) ‘Phylogenetic relationship of Alexandrium monilatum (Dinophyceae) to other Alexandrium species based on 18S ribosomal RNA gene sequences’, Harmful Algae, Vol. 5, pp.275–280. Saldarriaga, J.F., Taylor, F.J.R.M., Cavalier-Smith, T., Menden-Deuer, S. and Keeling, P.J. (2004) ‘Molecular data and the evolutionary history of dinoflagellates’, Eur. J. Protistol., Vol. 40, pp.85–111. Shimodaira, H. (2004) ‘Approximately unbiased tests of regions sing multistep-multiscale bootstrap resampling’, The Annuals of Statistics, Vol. 32, No. 6, pp.2616–2641. Suzuki, R. and Shimodaira, H. (2006) ‘Pvclust: an R package for assessing the uncertainty in hierarchical clustering’, Bioinformatics Applications Note, Vol. 22, No. 12, pp.1540–1542. Thomson, P.G., Wright, S.W., Bolch, C.J.S., Nichols, P.D., Skerratt, J.H. and McMinn, A. (2004) ‘Antarctic distribution, pigment and lipid composition, and molecular identification of the brine dinoflagellate Polarella glacialis (Dinophyceae)’, J. Phycol., Vol. 40, pp.867–873. Volkman, J.K. (2003) ‘Sterols in microorganisms’, Arch. Microbiol. Biotechnol., Vol. 60, pp.495–506. Volkman, J.K., Rijpstra, W.I.C., de Leeuw, J.W., Mansour, M.P., Jackson, A.E. and Blackburn, S.I. (1999) ‘Sterols of four dinoflagellates from the genus Prorocentrum’, Phytochemistry, Vol. 52, pp.659–668. Yoon, H.S., Hackett, J.D. and Bhattacharya, D. (2002) ‘A single origin of the peridinin- and fucoxanthin – containing plastids in dinoflagellates through tertiary endosymbiosis’, Proc. Natl. Acad. Sci., USA, Vol. 99, pp.11724–11729. Zhang, H., Bhattacharya, D. and Lin, S. (2007) ‘A three-gene dinoflagellate phylogeny suggests monophyly of prorocentrales and a basal position for Amphidinium and Heterocapsa’, J. Mol. Evol., Vol. 65, pp.463–474.

Notes 1

Enlarged versions of figures can be obtained from J. Leblond. The identification codes for sterol structures follow from the system used by Jones et al. (1994).

2

The average link agglomerative method is not designed to connect data that are ‘close’; on the other hand, the single link clustering method is designed for this objective.