Ecological Informatics 42 (2017) 46–54
Contents lists available at ScienceDirect
Ecological Informatics journal homepage: www.elsevier.com/locate/ecolinf
Consensus methods based on machine learning techniques for marine phytoplankton presence–absence prediction
MARK
M. Bourela,b, C. Criscic,*, A. Martínezd a
Instituto de Matemática y Estadística Prof. Ing. Rafael Laguardia, Facultad de Ingeniería, Julio Herrera y Reissig 565, CP 11200 Montevideo, Uruguay Departamento Métodos Matemático Cuantitativos, Facultad de Ciencias Económicas y Administración, Universidad de la República, Av. Gonzalo Ramírez 1926, CP 11200 Montevideo, Uruguay c Centro Universitario Regional del Este, Universidad de la República, Ruta Nacional n9 y Ruta n15, CP 27000 Rocha, Uruguay d Dirección Nacional de Recursos Acuáticos, M.G.A.P., Puerto de La Paloma, CP 27001 La Paloma, Rocha, Uruguay b
A R T I C L E I N F O
A B S T R A C T
Keywords: Marine phytoplankton Presence–absence data Machine learning Non-homogeneous consensus methods Prediction
We performed different consensus methods by combining binary classifiers, mostly machine learning classifiers, with the aim to test their capability as predictive tools for the presence–absence of marine phytoplankton species. The consensus methods were constructed by considering a combination of four methods (i.e., generalized linear models, random forests, boosting and support vector machines). Six different consensus methods were analyzed by taking into account six different ways of combining single-model predictions. Some of these methods are presented here for the first time. To evaluate the performance of the models, we considered eight phytoplankton species presence–absence data sets and data related to environmental variables. Some of the analyzed species are toxic, whereas others provoke water discoloration, which can cause alarm in the population. Besides the phytoplankton data sets, we tested the models on 10 well-known open access data sets. We evaluated the models' performances over a test sample. For most (72%) of the data sets, a consensus method was the method with the lowest classification error. In particular, a consensus method that weighted single-model predictions in accordance with single-model performances (weighted average prediction error — WA-PE model) was the one that presented the lowest classification error most of the time. For the phytoplankton species, the errors of the WA-PE model were between 10% for the species Akashiwo sanguinea and 38% for Dinophysis acuminata. This study provides novel approaches to improve the prediction accuracy in species distribution studies and, in particular, in those concerning marine phytoplankton species.
1. Introduction 1.1. A brief introduction to consensus methods In the classification framework of machine learning (ML), ensemble methods or aggregating methods consist in combining the predictions of several classifiers (also called hypotheses or base classifiers) that are performed over the same data set. The predictions are combined with the main goal of reducing variance and constructing a more stable and accurate predictor (James et al., 2014; Hastie et al., 2001; Bourel, 2012, 2013). Ensemble methods have had great success not only in the ML community, but also among researchers from different fields and with statistical modeling interests, because of their accuracy, which is generally higher than that of individual classifiers (Polikar, 2006). Despite the merits of these methods, it is often a challenge to understand completely the theoretical framework behind them.
*
The strategy of combining the outputs of different classifiers implies that individual classifiers make errors on different instances. The logic is that, if each classifier makes different errors, then a good combination of these classifiers can reduce the total error, improving the errors of not-so-good classifiers. For this, it is interesting to make each classifier as unique as possible with respect to misclassified instances. In particular, it is necessary to find classifiers whose decision boundaries are adequately different from those of others. Such a set of classifiers is said to be diverse (Polikar, 2006; Brown et al., 2005 and references therein). In general, however, ensemble algorithms do not attempt to maximize a specific diversity measure. Rather, increased diversity is usually sought somewhat heuristically through various resampling procedures, such as the selection (randomly or not) of different training parameters, models, or subsets of features. Ensemble methods can be classified into two categories: homogeneous and non-homogeneous. Homogeneous methods combine
Corresponding author. E-mail addresses: mbourel@fing.edu.uy (M. Bourel),
[email protected] (C. Crisci).
http://dx.doi.org/10.1016/j.ecoinf.2017.09.004 Received 28 May 2017; Received in revised form 6 September 2017; Accepted 9 September 2017 Available online 12 September 2017 1574-9541/ © 2017 Elsevier B.V. All rights reserved.
Ecological Informatics 42 (2017) 46–54
M. Bourel et al.
et al., 2009). Besides ML techniques, more classical techniques such as generalized linear modeling or linear discriminant analysis are usually considered in the consensus construction (Thuiller et al., 2009; Marmion et al., 2009a,b; Lauzeral et al., 2015; Comte and Grenouillet, 2013) since, in some cases (e.g., linear relations between the predictors and the response variable), these methods may outperform ML techniques. It must be noted that, although the consensus approach clearly has a number of attractive characteristics, the understanding of its merits for ecological prediction is still limited (Marmion et al., 2009b); hence, further studies comparing the predictive capacity of consensus methods with that of single methods are needed. It must be noted also that most of the applications of consensus methods in ecological studies are related to the study of species distribution models (SDMs) (Guisan and Thuiller, 2005). In this paper, we explore the performance of six different consensus methods for predicting the presence–absence of eight marine phytoplankton species from the Atlantic coast of Uruguay. Four of the methods are a mixture of experts, and the other two are stacking applications. Moreover, we analyze the performance of the consensus models by considering 10 well-known open access data sets. To generate the consensus, we combined four individual models with very different structures, three of which have been documented as some of the most accurate ML techniques: boosting, RF, and support vector machine (SVM), whereas the fourth is a generalized linear model (GLM) that could better capture the linear relationships in data. For a more detailed description of these models, we refer the reader to the Supplementary material.
classifiers of the same nature; examples of this type of methods are bagging (Breiman, 1996a), random forests (RF) (Breiman, 2001), and boosting (Freund and Schapire, 1997; Schapire and Freund, 1998). In this paper, we will pay attention to non-homogeneous methods and we will refer to them as consensus methods. Consensus methods consist of a combination of various methods of a different nature. Examples of this type of methods are stacking (Wolpert, 1992; Ting and Witten, 1999; Breiman, 1996b) and mixture of experts (Masoudnia and Ebrahimpour, 2014). The different predictors are combined in some way; for instance, in the case of mixture of experts, this is done generally by averaging (with or without weights) or by voting over the models' predictions. In the case of stacking, the outputs of the different classifiers are used to train another classifier, which makes the final decision rule of the methods. A way of doing a mixture of experts is inspired, to some extent, by Bayesian voting, and it consists in assigning a weight to each hypothesis (Kuncheva, 2014). A classifier h generally calculates the posterior probability that a given observation belongs to a class. To fix the notation, we can think that h computes a vector (p0h (x), p1h (x)), where p0h (x) and p1h (x) are the posterior probabilities that observation x belongs to class 0 or to class 1, respectively. The consensus of different intermediate classifiers h1, …, hM is to generate a classifier F of the form M
⎞ ⎛ F (x) = Argmax ⎜ ∑ w hm, L pkhm (x)⎟. k ∈ {0,1} ⎝ m = 1 ⎠ This type of combination is called a weighted averaging combining rule. In this paper, we will compare it empirically to other mixture-ofexpert rules and to two versions of stacking.
2. Methods 1.2. Consensus methods in ecological studies In this section, we present i) the data sets used to evaluate the performance of the models; ii) the principal concepts of supervised classification, iii) a description of the consensus models analyzed in this work; iv) the way in which we calculated the prediction error of the models; and v) the model tuning and optimization, and the use of software and functions.
Concerning the ecological modeling of species presence–absence, the performance of different statistical techniques could vary significantly from a particular case study to another, and it is not very clear sometimes which model is the most suitable. There are two possible strategies to reduce the models' uncertainty: (1) by acquiring an understanding via extensive model comparisons as to which method will generally provide the best predictive performance and in what conditions (Marmion et al., 2009b) and (2) by using consensus methods (i.e., non-homogeneous ensemble methods) (Thuiller, 2004; Thuiller et al., 2005; Araújo and New, 2007; Marmion et al., 2009b). As mentioned earlier, consensus methods overcome the problem of variability in the predictions of different single models since they are based on the combination of their predictions. Hence, a relevant combination of several unbiased (i.e., with good accuracy) model outputs will result in a more accurate prediction. The matter rests in choosing adequate single models and finding a relevant algorithm to combine them. When dealing with ecological problems, ML techniques seem to be good candidates for single models because of their predictive capacity (Olden and Jackson, 2002). These techniques are frequently and increasingly considered in ecological studies, in particular in modeling species presence–absence or abundance from environmental variables (De’ath and Fabricius, 2000; Guisan et al., 2002; Drake et al., 2006; Cutler et al., 2007; Kampichler et al., 2010; Olden and Jackson, 2002). ML methods have advantages over traditional statistical methods (e.g., linear models and generalized linear models) since they can deal with some characteristics typical of ecological data such as unusual distributions, non-linearity, multiple missing values, complex data interactions, and dependence on the observations (Guisan et al., 2002; Cutler et al., 2007; Crisci et al., 2012). Besides their flexibility, they typically outperform traditional approaches, making them ideal for modeling ecological systems (Olden et al., 2008). In fact, concerning ecological studies, ML methods are always considered when performing consensus models (Marmion et al., 2009a,b; Lauzeral et al., 2015; Comte and Grenouillet, 2013; Thuiller
2.1. Data sets 2.1.1. Marine phytoplankton data The marine phytoplankton data set is part of the Harmful algal blooms (HABs) monitoring program, which is conducted by the National Direction of Aquatic Resources of Uruguay. The program is carried out weekly since 1991 at fixed sites in the Atlantic coast of Uruguay. We decided to consider the 2011–2014 period because data were available for a greater number of phytoplankton species; furthermore, there was more information concerning the predictor variables. For the period considered, 196 observations were available. Surveys were carried out in two exposed sandy beaches with contrasting morphodynamics: Barra del Chuy (33° 45′ S, 53° 27′ W), which is a dissipative beach with fine to very fine well-stored sand, a gentle slope, heavy wave action, and a wide surf zone; and Arachania (34° 36′, 53° 44′ W), which is a reflective beach with coarse sediments and a steep slope (Bergamino et al., 2016) (Fig. 1). At each site, water samples were taken from the surf zone with a plastic bucket for chlorophyll a and phytoplankton quantification. Moreover, water temperature and salinity were measured in situ with an ISY ECO300 probe, and wind intensity and direction were estimated visually. Phytoplankton species were identified and counted in an Olympus IM inverted microscope thereafter Utermöhl (1958) at a final magnification of 1000 × (Andersen and Throndsen, 2003). Furthermore, the abundance of potential phytoplankton consumers was registered. Because of potential differences in prey preferences, we decided to consider the three following guilds of phytoplankton consumers: i) microcrustaceans, ii) ciliates and tintinids, and iii) ciliates, tintinids, and heterotrophic 47
Ecological Informatics 42 (2017) 46–54
M. Bourel et al.
Fig. 1. Map of the study area showing the two sites, Arachania and Barra del Chuy, where the phytoplankton samples were obtained and environmental variables were registered (Department of Rocha-Uruguay).
Table 1 Information about the open access data sets showing the denomination, the number of observations and variables, and the URLs from where the data set were downloaded. Denomination
No. of observations
No. of variables
Source and reference
Blood transfusion Credit approval Default Housing Liver disorders MAGIC Gamma Telescope04 Orange juice Parkinson's QSAR biodegradation Spam
748 690 10,000 506 345 19,020 1070 197 1055 4601
4 15 3 13 6 10 17 23 41 57
https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center, Lichman (2013) https://archive.ics.uci.edu/ml/datasets/Credit+Approval, Lichman (2013) http://www-bcf.usc.edu/~gareth/ISL/data.html, James et al. (2014) https://archive.ics.uci.edu/ml/machine-learning-databases/housing, Lichman (2013) http://archive.ics.uci.edu/ml/datasets/Liver+Disorders, Lichman (2013) https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope, Lichman (2013) https://cran.r-project.org/web/packages/ISLR/ISLR.pdf, James et al. (2014) https://archive.ics.uci.edu/ml/datasets/Parkinsons, Lichman (2013) https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation, Lichman (2013) https://archive.ics.uci.edu/ml/datasets/Spambase, Lichman (2013)
wind intensity was very low or non-existent).
dinoflagellates. Because of methodological constraints, microcrustaceans abundance was probably underestimated; however, as the same methodology was carried out for all samples, we decided to keep these variables in the analysis. Eight phytoplankton species with distinct ecological requirements and environmental impacts were analyzed: Alexandrium fraterculus, Akashiwo sanguinea, Asterionellopsis guyunusae, Cerataulina pelagica, Dinophysis acuminata, Leptocylindrus danicus, Rhizosolenia setigera, and Thalassionema nitzschioides. D. acuminata is a toxic species that frequently causes mollusk bans in Uruguay, whereas A. fraterculus, A. guyunusae, and A. sanguinea provoke discoloration of the water but cause no harm. The other species (all diatom species) are not toxic and do not provoke water discoloration, but can reach high abundances in late fall and late spring because of nutrient pulses caused by sediment resuspension driven by strong storm events. Eight data sets were considered to evaluate the performance of the models, one for each species. For each set, the response variable was the presence–absence data of the species. This variable was obtained from the categorization of the abundance variable: the species was absent when abundance was zero and present when abundance was greater than zero. The predictor variables were as follows:
• Microcrustaceans abundance: numerical • Ciliates and tintinids abundance: numerical • Total microheterotrophes abundance (ciliates, tintinids, and heterotrophic dinoflagellates): numerical
The abundances of microcrustaceans, ciliates and tintinids, and microheterotrophes were standardized. 2.1.2. Open access data sets Besides the marine phytoplankton data, we considered 10 open access data sets to test our models. All data sets presented a dichotomous response variable, with the exception of the housing data set. In this case, the response variable (median value of owner-occupied homes in thousands of US dollars) was continuous and, therefore, it was categorized into two classes ( < 20: class 0; ≥ 20: class 1) (Table 1). 2.2. Supervised classification We will work with a data set L = {(x1, y1), (x2, y2 ), …, (xn, yn )} , where each labeled observation (xi,yi) is composed of an input vector of real characteristics x i = (x i(1) , x i(2) , …, x i(d) ) and of the binary label yi. This data set will be used to train our models. The goal of the ML methods consists in finding a predictor drawn up from ℒ with the aim of predicting the assignment of a new observation using the same measured characteristics of the input vectors. The objective of the constructed model is to obtain an estimator that is as close as possible to the Bayes classifier (Devroye et al., 1997). This classifier assigns to an observation
• Site: categorical (Barra del Chuy and Arachania) • Season: categorical (summer, autumn, winter, and spring) • Temperature: numerical • Salinity: numerical • Wind direction: categorical (eight categories of wind direction were considered plus one category of “no wind” for the cases where the
48
Ecological Informatics 42 (2017) 46–54
M. Bourel et al.
x the class that maximizes the posterior probability of belonging to class k. For more details on ML theory, we refer the reader to James et al. (2014), Vapnik (1995), Hastie et al. (2001), and Devroye et al. (1997).
otherwise. 5. Stacking with Random Forests (StackRF). Inspired by the idea of StackGLM, we apply the same method by considering RF as a metaclassifier instead of the regression considered above. This provides a stacking that is not linear on its final output. This is not the first time that a tree-based method has been used as a meta-classifier (Todorovski and Džeroski, 2000; Džeroski and ženko, 2004); to our knowledge, however, RF has not been used yet as one. 6. Weighted Average Prediction Error (WA-PE). This method gives a classifier that is a linear combination of weighted single classifiers. As in WA-AUC, the sample is divided into two parts: for each model m, the first part is used for training the model and the second is used for computing the accuracy wm. After normalizing the weights wm of each classifier, we compute
2.3. Consensus methods An ensemble method has two components: several algorithms that generate the individual classifiers and a way of combining their outputs. In this subsection, we present the six consensus methods considered in this work. In all the cases, these classifiers are non-homogeneous ensemble methods that combine four single models: GLM, SVM, boosting, and RF (see the Supplementary material for details). To fix the notation, suppose that we work with M fixed different binary classifiers (i.e., the single models), we train each of these on the same learning sample ℒ.
M
WA-PE0 (x)=
1. Majority Vote (MV). For a given observation, the M classifiers make the class prediction. The consensus consists in choosing the class that receives more votes. 2. Mean Probability (MeanProb). For a given observation, we will obtain two vectors of probabilities of length M: one with the probabilities of belonging to class 0 (v0), and the other with the probabilities of belonging to class 1 (v1). This pair of vectors will be obtained for each of the single methods. Finally, we average the probabilities of v0 on one side and v1 on the other side along with the single models. The predicted class is the one that presents the highest mean probability. 3. Weighted Average AUC (WA-AUC). This classifier is very similar to the WA used in Marmion et al. (2009b). The sample is divided into two parts: one for training the model (two thirds of the original learning sample) and the other (the remaining one third) for computing its accuracy. For a given observation, we consider the weighted mean of the posteriori probabilities of different single models (constructed with the two thirds of the learning sample). The weights are calculated as the area under the ROC curve (AUC) of each method (obtained with the remaining one third of the learning sample). After the normalization, the WA-AUC expressions are as follows:
∑ AUCm× p0hm (x)
M
WA-PE1 (x)=
The observation x is assigned to class 1 if WA-PE1(x) > WA-PE0(x) or to class 0 otherwise. For further information about the theoretical and practical framework of this method, we refer the reader to Fumera and Roli (2005) or Kuncheva (2014). 2.4. Computation of the generalization error To obtain an “honest” estimation of a model error, it is usual, in the context of ML, to consider the generalization error (James et al., 2014). This is a measure of how accurately an algorithm can predict the outcome values from “new” data. This is a common practice in the ML context since ML techniques tend to overfit the data, and, therefore, considering an error estimation based on the data that were used to evaluate the performance of the model will probably lead to an overestimation of the model's performance. To calculate this error, we first train the model with a learning sample and then we test it on a new data set, called the test sample, that has not contributed to the construction of the model. One way to accomplish this is to randomly split the original data set ℒ into two parts: in about two thirds of the data, the learning sample ℒT is used to estimate or train the model, and in the remaining one third, the test sample ℒS is used to measure its performance, i.e., calculate the generalization error (Fig. 2). Therefore, when we refer to the generalization error of a classifier, we will refer to err(f,ℒS), which can be expressed as
and
M
∑ AUCm× p1hm (x) m=1
where AUCm is the area under the ROC curve of the hypothesis hm and pkhm (x) is the posterior probability of the observation x belonging to the class k for the hypothesis hm. The consensus for an observation x is 0 if WA-AUC0(x) > WA-AUC1(x) or 1 otherwise. The implementation of this method is a bit different from that adopted by Marmion et al. (2009b) because, in our version, we keep all the models of the consensus, whereas the version of Marmion et al. (2009b) keeps the four best models out of the eight. 4. Stacking with GLM (StackGLM). Stacking (Wolpert, 1992) consists (generally via a cross-validation process) in fitting several base classifiers and using their predictions to compute a new learning sample to train another classifier (called meta-classifier). As in Ting and Witten (1999), we used the class probabilities rather than the class predictions of the single methods and we considered a logistic regression as a meta-classifier. For each class, we compute M
LR 0 (x) =
∑ αm0 p0hm (x) m=1
err (f , LS ) =
1 n
n
∑ {f (xi)≠yi} i=1
where {f(xi)≠yi} = 1 if f(xi)≠yi or 0 otherwise. This measure, gives the proportion of misclassified cases over a test sample. Clearly, this error will depend on the splitting done over the data, and, therefore, to obtain an unbiased estimation, we perform the splitting several times and the final performance is the mean performance over all the splits. By doing this, we have an empirical value of the expected error of the method. To achieve this, in this study, we performed for each model 50 independent splits of about two thirds to one third of the data. For information on further resampling methods to calculate the generalization error, we refer the reader to James et al. (2014), Vapnik (1995), Hastie et al. (2001) and Devroye et al. (1997).
M
and LR1 (x) =
∑ wm p1hm (x). m=1
m=1
WA-AUC1 (x)=
and,
m=1
M
WA-AUC0 (x)=
∑ wm p0hm (x)
∑ αm1 p1hm (x)
2.5. Simulations and tuning of the methods
m=1
where the coefficients αm are obtained by minimizing a mean least squared optimization problem as in linear regression. The observation x is assigned to class 1 if LR1(x) > LR0(x) or to class 0
All simulations were performed in R (R Core Team, 2016). We used the function glm of the package MASS for the GLM, the package e1071 for the SVM, and the package randomForest for the RF. As boosting in 49
Ecological Informatics 42 (2017) 46–54
M. Bourel et al.
performed the best only once (Table 2, Fig. 3b), whereas MeanProb, WA-AUC, StackGLM, GLM, and boosting never presented the lowest generalization error. Among these models, boosting and StackGLM were the models that showed the highest generalization errors (the mean generalization error of all species was 29.43% and 29.27%, respectively). Generalization error differences between the best and the worst model were, on average, 5.2% ( ± 1.7%); the greatest difference was 8.77% for A. sanguinea (best model: StackRF; worst model: GLM), and the smallest was 3.53% for R. setigera (best model: WA-PE; worst model: boosting). For the open access data sets, the comparison between the bases was not relevant since they concerned data with a very different nature. Within the bases, again, in most of the cases, the consensus methods were the models that performed better (seven cases out of ten). Within these methods, the “winners” were the same as those for the phytoplankton data sets: WA-PE was the model with the greatest number of wins (i.e., five wins), followed by MV and StackRF (each with one win). For the simple models, GLM, boosting, and RF presented the lowest generalization error only once each (Table 3, Fig. 3c). No wins were observed for the MeanProb, WA-AUC, StackGLM, and SVM models (Fig. 3c). The lowest mean generalization error was observed for RF (13.71%), whereas the highest was found for GLM (23.49%). The poor performance of GLM for most data sets was remarkable, presenting approximately two-, four-, and six-fold higher errors than the other models and being the worst model in 8 of 10 cases (Table 3). The generalization error differences between the best and the worst model were, on average, 11.04 ( ± 6.9%); the greatest difference was 26.9% for Spam (best model: StackRF; worst model: GLM), and the smallest was 4.6% for the Orange Juice data set (best and worst model: GLM and StackGLM, respectively). These results indicate the great differences in the models' performance within and between the data sets. Despite the aforementioned differences in the models' behavior, it must be noted that, in many cases, there were no very important differences in mean generalization errors between the models (Fig. 4a and b).
Fig. 2. Computation of the generalization error. The data are randomly split into two samples: the first one, called the training sample or learning sample, is used to construct the model, and the second one, called the test sample, is used to compute the generalization error. To avoid the bias caused by the random split, it is necessary to run this procedure several times and the final error is the average of all the errors obtained over the different test samples.
its most known version AdaBoost uses trees in each step, we constructed these trees with CART (package tree). This procedure is equivalent to the original AdaBoost algorithm (Breiman, 1998). At each iteration, the size of the corresponding tree is optimal over a test sample (which is a bootstrap sample drawn at the same iteration, using the current weights). To obtain the optimal GLM model, we implemented the AIC criterion by considering the stepAIC function. For some of the phytoplankton species, we analyzed the variable importance with the variable importance tool provided by RF (varImpPlot function of the randomForest package). For SVM, we used a radial kernel and optimized the values of the cost-complexity parameter and γ with the function tune.svm. Finally, for RF and AdaBoost, we used 300 intermediate trees. For both stacking versions, we omitted the cross-validation procedure to compute the predictions of the base classifiers considered in Ting and Witten (1999) since the obtained results were similar to those obtained with the cross-validation. Furthermore, the computation time without using cross-validation was considerably shorter. All the methods were trained and tested as we explained in Section 2.4.
3.2. Marine phytoplankton presence–absence prediction and the importance of environmental variables The generalization error when predicting the presence–absence of marine phytoplankton species varied importantly between species. As mentioned earlier, the errors were between 10.45% and 39.55% (mean of all models within species), indicating that the analyzed predictors and/or the selected models were more adequate in some cases than in others. In this subsection, we present the results of those species that gave the best performance (i.e., the lowest generalization errors). Besides presenting the models' errors, we show the results of the variable importance according to RF. There were three species that presented low or moderate generalization errors based on the result of the best-performing model: A. sanguinea (StackRF generalization error =8.8 ± 3.4%), A. fraterculus (RF generalization error =15.85 ± 3.9%), and A. guyunusae (VM generalization error =20.77 ± 3.9%). Regarding the RF variable importance plots, the results for A. sanguinea and A. fraterculus indicate the importance of salinity, temperature, season, and variables concerning predator abundance (Fig. 5a and b). For A. guyunusae, site was the most important variable, although other variables such as salinity, temperature, and ciliates and tintinids abundance were also important (Fig. 5c).
3. Results 3.1. Models' performance With all the data sets considered together, the WA-PE consensus method was the model that presented the lowest generalization error in most of the cases (9 cases out of 18) (Fig. 3, Tables 2 and 3). MV and StackRF among the consensus methods, and RF among the single methods, were next to WA-PE in “number of wins” (two wins each) (Fig. 3a). Finally, the remaining methods presented the lowest generalization error only once (GLM, boosting, and SVM) or on no occasion at all (MeanProb, WA-AUC, and StackGLM) (Fig. 3a). The phytoplankton data sets presented variable error rates among the species (the average generalization error of all models was between 10.45% and 39.55%; Table 2) (see also Section 3.2). Among the models (within species), most of the time a consensus method was the model that performed better. The WA-PE was the model with the greatest number of wins (i.e., five wins over eight species; Fig. 3b) and also with the lowest mean error (25.98%; Fig. 4a). StackRF, MV, RF, and SVM
4. Discussion 4.1. Consensus models' performance In this study, we applied six different consensus methods to predict the presence–absence of marine phytoplankton species. Furthermore, 50
Ecological Informatics 42 (2017) 46–54
M. Bourel et al.
Fig. 3. Number of wins of the different methods, considering the 18 data sets together (a), the marine phytoplankton data sets (b), and the open access data sets (c).
Concerning the single models, even though boosting and SVM presented generally good performances, RF was the one that performed the best; this is not surprising considering the high accuracy reported for this model in many examples (Breiman, 2001). On the contrary, GLM had poorer performances. For example, for the phytoplankton data set, it was not always the worst model, but in no case was it the best. For the open access data sets, it performed the best only once; for the remaining data sets, however, it generally presented the worst behavior. It is true that the difference in magnitude of the classification error between the best model and the next best-performing models was not very important; however, considerable differences were found between the best and the worst models (the greatest difference in errors for the phytoplankton species data set was 8.77% and for the open data sets was 26.9%). Considering that, in general, a consensus models was the best model, and when they did not outperform the single models, they took the second or third place, it appears that, when in doubt as to which single model to use (e.g., when one is not convinced that a single model is closest to the truth in all circumstances (Araújo and New, 2007), it would be interesting to consider the consensus models. Further work considering a larger number of data sets would be desirable to better assess the differences in performance between the models considered in this work. An interesting topic regarding model comparisons is defining which measure of model performance to use. In this context, the AUC receiveroperating characteristic (ROC) has been broadly used in ecological studies, mainly when modeling species distribution (Brotons et al., 2004; Moisen et al., 2006; Elith et al., 2006; Marmion et al., 2009b). However, this criterion has been importantly criticized (Lobo et al.,
we evaluated the performance of the models using open access data sets. To construct the consensus, we decided to combine three ML techniques that are well known in the ML community, present very good performance generally, and, at the same time, are well known and broadly used in ecological studies (e.g., Cutler et al., 2007, De’ath, 2007, Guo et al., 2005). It must be noted that two (i.e., boosting and RF) of these techniques already constitute the ensemble methods. The fourth method (i.e., GLM) is a linear model that could perform better where there are linear relations between the response and the predictors. One of the principal issues to acknowledge when constructing a consensus method is to make each classifier as unique as possible, particularly with respect to misclassified instances (Polikar, 2006). One way to accomplish this is by aggregating very different types of classifiers (Polikar, 2006), and that is what we intended to do in this work. From an ecological perspective, obtaining a consensus by considering an ensemble of classifiers could be interpreted as considering at once several copies of a system, each of which represents a possible state that the real system might be at some specific time (Araújo and New, 2007). Consequently, taking into account base models with different structures will be more representative of different system states. We must outline the limitations of performing a consensus method with an extremely large number of models, each of which has to be optimized properly, a task that it is not always evident. The consensus methods presented in this study are based on a limited number of well-known and easily optimizable models but with very different characteristics. Our results indicate that, in general, the consensus methods performed better than the single models, although there were exceptions.
Table 2 Mean generalization error ( ± SD) on the eight marine phytoplankton presence-absence data sets with their standard deviation over 50 test samples. For each data set, the errors with bold emphasis correspond to the model with the lowest error.
A. fraterculus A. sanguinea A. guyunusae C. pelagica D. acuminata L. danicus R. setigera T. nitzschioides
GLM
RF
Boosting
SVM
VM
MeanProb
WA-AUC
StackGLM
StackRF
WA-PE
19.51 ± 5.4 17.57 ± 10.1 21.75 ± 4.6 35.72 ± 4.7 41.23 ± 5.7 26.49 ± 4.9 31.17 ± 5.2 34.74 ± 5.7
15.85 ± 3.9 9.32 ± 3.6 22.86 ± 3.9 31.41 ± 5.2 38.34 ± 4.8 27.35 ± 5.4 30.15 ± 3.8 34.77 ± 5.9
20.31 ± 4.2 9.38 ± 3.6 26.71 ± 4.2 34.09 ± 5.5 41.88 ± 5.8 30.06 ± 5.5 33.14 ± 4.4 39.85 ± 4.8
16.15 ± 3.7 10.86 ± 3.4 21.94 ± 4.4 36.37 ± 4.6 39.69 ± 5.6 25.97 ± 4.4 30.4 ± 5.4 36.06 ± 5.1
16.65 ± 4.0 9.54 ± 3.7 20.77 ± 3.9 32.65 ± 6 38.92 ± 4.5 26.83 ± 5.4 29.79 ± 4.8 34.28 ± 6.1
17.32 ± 4.6 10.37 ± 3.5 20.86 ± 4.3 31.51 ± 5.4 38.12 ± 4.4 27.41 ± 5.8 30 ± 5.0 34.22 ± 6.1
16.43 ± 4.2 9.94 ± 3.7 21.08 ± 4.4 31.45 ± 5.4 38.12 ± 4.5 27.32 ± 5.5 30.59 ± 5.2 34.09 ± 6.2
19.88 ± 4.0 8.95 ± 3.6 26.71 ± 4.4 33.94 ± 5.4 41.81 ± 5.7 30.12 ± 5.6 32.89 ± 4.6 39.91 ± 4.8
18.21 ± 3.9 8.8 ± 3.4 24.37 ± 4.2 32.89 ± 5.6 40.86 ± 5.3 29.11 ± 5.7 31.82 ± 4.6 37.41 ± 4.9
16.02 ± 4.3 10.31 ± 3.4 21.08 ± 4.4 31.35 ± 5.3 38.0 ± 4.5 27.48 ± 5.5 29.61 ± 5.2 34.01 ± 6.1
51
Ecological Informatics 42 (2017) 46–54
M. Bourel et al.
Table 3 Mean generalization errors ( ± SD) on the 10 open access data sets with their standard deviation over 50 test samples. For each data set, the errors with bold emphasis correspond to the model with the lowest error.
Blood transfusion Credit approval Default Housing Liver disorders MAGIC Orange juice Parkinsons QSAR biodegradation Spam
GLM
RF
Boosting
SVM
VM
MeanProb
WA-AUC
StackGLM
StackRF
WA-PE
22.74 ± 2.6 22.78 ± 18.3 16.86 ± 2.6 17.21 ± 3.8 33.65 ± 4.3 18.18 ± 0.4 17.26 ± 1.8 22.74 ± 5.6 30.74 ± 5.9 31.96 ± 1.9
24.44 ± 2.1 13.38 ± 3.2 2.89 ± 0.2 13.32 ± 2.4 29.63 ± 4.2 11.77 ± 0.3 19.68 ± 2 12.49 ± 4.6 14.94 ± 2.0 5.35 ± 0.5
32.48 ± 3.4 14.83 ± 3.4 4.87 ± 0.5 12.46 ± 2.4 31.61 ± 4.4 17.59 ± 0.6 21.89 ± 2 10.65 ± 3.8 15.05 ± 1.9 5.32 ± 0.7
22.92 ± 2.6 13.61 ± 3.2 2.76 ± 0.2 12.81 ± 2.0 33.4 ± 5.2 14.39 ± 0.4 17.74 ± 1.7 14.28 ± 4.3 16.99 ± 3.1 7.29 ± 0.6
24.26 ± 2.1 16.72 ± 8.7 2.75 ± 0.2 12.44 ± 2.2 29.75 ± 4.7 12.79 ± 0.4 18.4 ± 1.8 13.23 ± 4.5 14.76 ± 1.8 5.56 ± 0.6
22.34 ± 2.5 13.32 ± 3.2 3.19 ± 0.2 12.25 ± 1.9 28.77 ± 4.0 15.89 ± 0.4 18.99 ± 2.1 14.0 ± 4.0 14.35 ± 1.9 5.63 ± 0.6
22.28 ± 2.4 13.27 ± 3.2 3.18 ± 0.3 12.23 ± 1.9 28.67 ± 3.9 15.45 ± 0.4 18.99 ± 2.1 13.29 ± 4.2 14.36 ± 1.9 5.56 ± 0.6
27.95 ± 2.7 14.81 ± 3.4 3.65 ± 0.4 12.58 ± 2.4 31.56 ± 4.4 12.08 ± 0.4 21.62 ± 2.2 18.74 ± 8.7 16.08 ± 7.4 5.16 ± 0.7
26.27 ± 2.5 14.3 ± 3.4 3.73 ± 0.4 12.51 ± 2.3 30.67 ± 4.6 11.81 ± 0.3 21.18 ± 1.9 11.66 ± 4.1 14.76 ± 1.7 5.08 ± 0.6
22.2 ± 2.6 13.25 ± 3.2 3.29 ± 0.3 12.24 ± 1.9 28.65 ± 3.8 15.66 ± 0.4 18.99 ± 2.0 13.72 ± 3.9 14.34 ± 1.9 5.55 ± 0.6
4.2. Consensus models for predicting marine phytoplankton species
2008; Manel et al., 2001), with some of the arguments indicating that i) it ignores the predicted probability values and the goodness of fit of the model; (ii) it summarizes the test performance over regions of the ROC space where one would rarely operate; and (iii) it weights the omission and commission errors equally (Lobo et al., 2008). The generalization error used in our work represents a very simple and interpretable measure of model performance. Although it tests the model accuracy by considering a unique threshold, this threshold could be optimized on a case-to-case basis (in particular for imbalanced data) by using relative simple methodologies (Kuhn and Johnson, 2013). The WA-PE and the StackRF consensus methods were integrated into the methods that presented better performances. In particular, the WA-PE was the model with the lowest generalization error in most of the cases, considering both types of data sets. Moreover, taking into account the phytoplankton data set, this model was the one with the lowest mean generalization error. Although consensus methods that weight different models according to their performance have been previously reported (Marmion et al., 2009b; Alexandre et al., 2000; Kuncheva, 2014), WA-PE has a particularity in that the weight of each classifier is directly related to its global performance on a test sample. In the light of our results, StackRF appears as a very promising technique. This technique is presented here for the first time; hence, further research to better understand its functioning is needed.
Harmful algal bloom (HAB) problems are growing worldwide, and the need to understand these phenomena is more pressing than ever. Despite the rapidly expanding observational capabilities owing to technological advances, HAB processes continue to be undersampled, and, therefore, the development of statistical models with good performances considering the available data is of main interest (Moore et al., 2008; McGillicuddy Jr, 2010). In particular, the prediction of species that are toxic or that are nontoxic but provoke important water discoloration (which can cause alarm in the population) is an important issue for water managers. In this study, the models' accuracy for some of this species was very satisfactory (some models presented errors of nearly 9%, 15%, and 20%), especially considering that important variables in the phytoplankton ecology and, therefore, important predictor variables such as nutrients were not considered. Previous studies for predicting the presence–absence or abundance of marine phytoplankton species are mainly restricted to generalized linear models (GLMs) (Richardson et al., 2003; Lane et al., 2009; Anderson et al., 2010). ML techniques are more rarely used, although there are some exceptions mainly concerning neural Networks and SVMs (Scardi and Harding, 1999; Lee et al., 2003; Vilas et al., 2014). These studies were mostly about freshwater ecosystems (Recknagel et al., 1998; Wilson and Recknagel, 2001; Jeong et al., 2001; Kruk and Segura, 2012). To our knowledge, this is the first attempt at applying consensus methods to predict the presence–absence of marine
Fig. 4. Boxplots of the generalization errors of the different methods. Generalization error of each method on the a) phytoplankton presence–absence data sets and b) open access data sets. In both cases, we included on the right of the plot the box of the consensus method with the lowest median (Min_cons) and the box for the method (among all methods) with the lowest median (Min).
52
Ecological Informatics 42 (2017) 46–54
M. Bourel et al.
Fig. 5. Variable importance plots of random forests for (a) A. fraterculus, (b) A. sanguinea, and (c) A. guyunusae.
5. Conclusions
phytoplankton species. Considering that each single method of the consensus could represent a possible state in which a real system might be at some specific time (Araújo and New, 2007), and the complexity of phytoplankton dynamics (Medvinsky et al., 2002), these methods could be more suitable than single models for predictive purposes. In fact, our results indicated that, in most of the cases, the consensus methods performed better than the single methods (although, as mentioned earlier, the differences in the magnitude of errors were not very significant). Further research considering additional species, predictor variables, and samples must be addressed to better understand the potential of these methods. Regarding the RF variable importance plots, the results for A. sanguinea and A. fraterculus indicate the importance of salinity, temperature, season, and variables concerning predator abundance (Fig. 5a). These are two species of dinoflagellates that can sustain their position in the water column and, thus, are favored by stable stratification (Smayda, 2002). In the Uruguayan coast, the freshwater discharge of the Rio de la Plata (second largest basin in South America) enhances the stratification of the water column, favoring this stability (Ortega and Martínez, 2007). Regarding the season variable, autumn is related with the start of the river discharge as well as with the highest temperatures of the year (Ortega and Martínez, 2007). All these features favor the increased abundance of A. sanguinea and A. fraterculus. It is not surprising that variables concerning predators are also the most important since the grazing pressure of the considered predators plays a significant role in controlling phytoplankton abundance (Jeong, 1999). In fact, after a bloom of any of these species, an important increase in the abundance of predators has been observed (Martínez unpublished data). For A. fraterculus, the abundances of microheterotrophes, and ciliates and tintinids may be more important than that of microcrustaceans since they are largely more abundant (Jeong, 1999; Jeong et al., 2013; Jeong et al., 2015; Kim et al., 2016). A. guyunusae is a diatom typical of the surf zone of dissipative beaches (Campbell, 1996; Odebrecht et al., 2014). Hence, the variable site was expected to be the most relevant, although salinity, temperature, and predator abundance are also important variables in this species dynamics (Odebrecht et al., 2014). Wind intensity was expected to be important too because A. guyunusae has resistant cells that persist in sediment during calm weather and are resuspended in the presence of storms (Talbot et al., 1990; Odebrecht et al., 2014); in the study region, calm weather is related with northern winds, whereas storms occur with southern winds. The low importance observed for this variable must be related to the way in which it was measured (categories via visual observation).
Consensus methods present an interesting alternative for developing predictive tools to create sound monitoring and management tools. They have shown to produce favorable results compared to those by single methods (Polikar, 2006), although further applications in the ecology area must be addressed to determine the potential of these methods. In particular, further knowledge in the context of marine phytoplankton, and especially on species that represent challenges for water managers and decision makers, is needed. There is a growing consensus that the weighted average could be a very appropriate approach to combine models because of its consistent performance over a broad spectrum of applications (Polikar, 2006; Fumera and Roli, 2005). Ideally, accurate models with not very complex structures must be provided to ecologists who are interested in accurate predictive models. In this work, we presented relative simple consensus models constructed using few, well-known, and accurate models, combined in simple but relevant ways. In several cases, these models behaved better than the single methods, with the best-performing consensus model being the one that considered the weighted average as the combination rule. The observed differences, although small, prove that attention must to given to this type of method when dealing with ecological prediction. Acknowledgments This work was supported by ECOS-Sud Aprendizaje Automático para la Modelización y el Análisis de Recursos Naturales (project n° U14E02) and by ANII-Uruguay. References Alexandre, L.A., Campilho, A.C., Kamel, M.S., 2000. Combining independent and unbiased classifiers using weighted average. In: 15Th International Conference on Pattern Recognition, ICPR’00, Barcelona, Spain, September 3-8, 2000. pp. 2495–2498. Andersen, P., Throndsen, J., 2003. Estimating Cell Numbers. Manual on Harmful Marine Microalgae. UNESCO Publishing, Paris, pp. 99–129. Anderson, C.R., Sapiano, M.R., Prasad, M.B.K., Long, W., Tango, P.J., Brown, C.W., Murtugudde, R., 2010. Predicting potentially toxigenic pseudo-Nitzschia blooms in the chesapeake Bay. J. Mar. Syst. 83, 127–140. Araújo, M.B., New, M., 2007. Ensemble forecasting of species distributions. Trends Ecol. Evol. 22, 42–47. Bergamino, L., Martínez, A., Han, E., Lercari, D., Defeo, O., 2016. Trophic niche shifts driven by phytoplankton in sandy beach ecosystems. Estuar. Coast. Shelf Sci. 180, 33–40. Bourel, M., 2012. Model aggregation methods and applications. Mem. Trab. difusión Cient. Tec. 10, 19–32.
53
Ecological Informatics 42 (2017) 46–54
M. Bourel et al.
California. Mar. Ecol. Prog. Ser. 383, 37–51. Lauzeral, C., Grenouillet, G., Brosse, S., 2015. The iterative ensemble modelling approach increases the accuracy of fish distribution models. Ecography 38, 213–220. Lee, J.H., Huang, Y., Dickman, M., Jayawardena, A.W., 2003. Neural network modelling of coastal algal blooms. Ecol. Model. 159, 179–201. Lichman, M., 2013. UCI Machine Learning Repository. Lobo, J.M., Jiménez-Valverde, A., Real, R., 2008. AUC: a misleading measure of the performance of predictive distribution models. Glob. Ecol. Biogeogr. 17, 145–151. Manel, S., Williams, H.C., Ormerod, S.J., 2001. Evaluating presence–absence models in ecology: the need to account for prevalence. J. Appl. Ecol. 38, 921–931. Marmion, M., Hjort, J., Thuiller, W., Luoto, M., 2009a. Statistical consensus methods for improving predictive geomorphology maps. Comput. Geosci. 35, 615–625. Marmion, M., Parviainen, M., Luoto, M., Heikkinen, R., Thuiller, W., 2009b. Evaluation of consensus methods in predictive species distribution modelling. Divers. Distrib. 15, 59–69. Masoudnia, S., Ebrahimpour, R., 2014. Mixture of experts: a literature survey. Artif. Intell. Rev. 42, 275–293. McGillicuddy Jr, D., 2010. Models of harmful algal blooms: conceptual, empirical, and numerical approaches. J. Mar. Syst. 83, 105–107. Medvinsky, A.B., Petrovskii, S.V., Tikhonova, I.A., Malchow, H., Li, B.-L., 2002. Spatiotemporal complexity of plankton and fish dynamics. SIAM Rev. 44, 311–370. Moisen, G.G., Freeman, E.A., Blackard, J.A., Frescino, T.S., Zimmermann, N.E., Edwards, T.C., 2006. Predicting tree species presence and basal area in Utah: a comparison of stochastic gradient boosting, generalized additive models, and tree-based methods. Ecol. Model. 199, 176–187. Moore, S.K., Trainer, V.L., Mantua, N.J., Parker, M.S., Laws, E.A., Backer, L.C., Fleming, L.E., 2008. Impacts of climate variability and future climate change on harmful algal blooms and human health. Environ. Health 7, S4. Odebrecht, C., Du Preez, D.R., Abreu, P.C., Campbell, E.E., 2014. Surf zone diatoms: a review of the drivers, patterns and role in sandy beaches food chains. Estuar. Coast. Shelf Sci. 150, 24–35. Olden, J.D., Jackson, D.A., 2002. A comparison of statistical approaches for modelling fish species distributions. Freshw. Biol. 47, 1976–1995. Olden, J.D., Lawler, J.J., Poff, N.L., 2008. Machine learning methods without tears: a primer for ecologists. Q. Rev. Biol. 83, 171–193. Ortega, L., Martínez, A., 2007. Multiannual and seasonal variability of water masses and fronts over the Uruguayan Shelf. J. Coast. Res. 23, 618–629. Polikar, R., 2006. Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6, 21–45. Core Team, R., 2016. R: A Language and Environment for Statistical Computing. In: R Foundation for Statistical Computing, Vienna, Austria. Recknagel, F., Fukushima, T., Hanazato, T., Takamura, N., Wilson, H., 1998. Modelling and prediction of phyto-and zooplankton dynamics in Lake Kasumigaura by artificial neural networks. Lakes Reserv. Res. Manag. 3, 123–133. Richardson, A., Silulwane, N., Mitchell-Innes, B., Shillington, F., 2003. A dynamic quantitative approach for predicting the shape of phytoplankton profiles in the ocean. Prog. Oceanogr. 59, 301–319. Scardi, M., Harding, L.W., 1999. Developing an empirical model of phytoplankton primary production: a neural network case study. Ecol. Model. 120, 213–223. Schapire, R.E., Freund, Y., 1998. Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Stat. 26, 322–330. Smayda, T.J., 2002. Adaptive ecology, growth strategies and the global bloom expansion of dinoflagellates. J. Oceanogr. 58, 281–294. Talbot, M., Bate, G., Campbell, E., 1990. A review of the ecology of surf-zone diatoms, with special reference to Anaulus australis. Oceanogr. Mar. Biol. Annu. Rev. 28, 155–175. Thuiller, W., 2004. Patterns and uncertainties of species' range shifts under climate change. Glob. Chang. Biol. 10, 2020–2027. Thuiller, W., Lafourcade, B., Engler, R., Araújo, M.B., 2009. BIOMOD — a platform for ensemble forecasting of species distributions. Ecography 32, 369–373. Thuiller, W., Lavorel, S., Araújo, M.B., Sykes, M.T., Prentice, I.C., 2005. Climate change threats to plant diversity in Europe. Proc. Natl. Acad. Sci. U. S. A. 102, 8245–8250. Ting, K.M., Witten, I.H., 1999. Issues in stacked generalization. J. Artif. Intell. Res. 10, 271–289. Todorovski, L., Džeroski, S., 2000. Combining Multiple Models with Meta Decision Trees Principles of Data Mining and Knowledge Discovery. volume 1910 of Lecture Notes in Computer Science. Springer Berlin/Heidelberg. Utermöhl, H., 1958. Zur Vervollkommnung der Quantitativen Phytoplankton Methodik. Schweizerbart Science Publishers, Stuttgart, Germany. Vapnik, V.N., 1995. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc. Vilas, L.G., Spyrakos, E., Palenzuela, J.M.T., Pazos, Y., 2014. Support vector machinebased method for predicting pseudo-Nitzschia spp. blooms in coastal waters (Galician Rias, NW Spain). Prog. Oceanogr. 124, 66–77. Wilson, H., Recknagel, F., 2001. Towards a generic artificial neural network model for dynamic predictions of algal abundance in freshwater lakes. Ecol. Model. 146, 69–84. Wolpert, D., 1992. Stacked generalization. Neural Netw. 5, 241–259.
Bourel, M., 2013. Apprentissage statistique par aggregation de modeles. Ph.D Thesis Université Aix-Marseille, France. Breiman, L., 1996a. Bagging predictors. Mach. Learn. 24, 123–140. Breiman, L., 1996b. Stacked regression. Mach. Learn. 24, 49–64. Breiman, L., 1998. Arcing classifiers. Ann. Stat. 26, 801–849. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. Brotons, L., Thuiller, W., Araújo, M.B., Hirzel, A.H., 2004. Presence–absence versus presence-only modelling methods for predicting bird habitat suitability. Ecography 27, 437–448. Brown, G., Wyatt, J., Harris, R., Yao, X., 2005. Diversity creation methods: a survey and categorisation. IEEE Circuits Syst. Mag. 6, 5–20. Campbell, E.E., 1996. The global distribution of surf diatom accumulations. Rev. Chil. Hist. Nat. 69 (4), 495–501. Comte, L., Grenouillet, G., 2013. Species distribution modelling and imperfect detection: comparing occupancy versus consensus methods. Divers. Distrib. 19, 996–1007. Crisci, C., Ghattas, B., Perera, G., 2012. A review of supervised machine learning algorithms and their applications to ecological data. Ecol. Model. 240, 113–122. Cutler, D.R., Edwards, T.C., Beard, K.H., Cutler, A., Hess, K.T., Gibson, J., Lawler, J.J., 2007. Random forests for classification in ecology. Ecology 88, 2783–2792. De’ath, G., 2007. Boosted trees for ecological modeling and prediction. Ecology 88, 243–251. De’ath, G., Fabricius, K.E., 2000. Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81, 3178–3192. Devroye, L., Györfi, L., Lugosi, G., 1997. A Probabilistic Theory of Pattern Recognition, volume 31 of Applications of Mathematics. Springer corrected 2nd edition. Drake, J.M., Randin, C., Guisan, A., 2006. Modelling ecological niches with support vector machines. J. Appl. Ecol. 43, 424–432. Džeroski, S., ženko, B., 2004. Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 54, 255–273. Elith, J., Graham, C.H., Anderson, R.P., Dudík, M., Ferrier, S., Guisan, A., Hijmans, R.J., Huettmann, F., Leathwick, J.R., Lehmann, A., Li, J., Lohmann, L.G., Loiselle, B.A., Manion, G., Moritz, C., Nakamura, M., Nakazawa, Y., Overton, J.M.c.C.M., Townsend Peterson, A., Phillips, S.J., Richardson, K., Scachetti-Pereira, R., Schapire, R.E., Sobern, J., Williams, S., Wisz, M.S., Zimmermann, N.E., 2006. Novel methods improve prediction of species distributions from occurrence data. Ecography 29, 129–151. Freund, Y., Schapire, R., 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139. Fumera, G., Roli, F., 2005. A theoretical and experimental analysis of linear combiners for multiple classifier systems. IEEE Trans. Pattern Anal. Mach. Intell. 27, 942–956. Guisan, A., Edwards, T., Hastie, T., 2002. Generalized linear and generalized additive models in studies of species distributions: setting the scene. Ecol. Model. 157, 89–100. Guisan, A., Thuiller, W., 2005. Predicting species distribution: offering more than simple habitat models. Ecol. Lett. 8, 993–1009. Guo, Q., Kelly, M., Graham, C.H., 2005. Support vector machines for predicting distribution of sudden oak death in California. Ecol. Model. 182, 75–90. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc. James, G., Witten, D., Hastie, T., Tibshirani, R., 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated. Jeong, H.J., 1999. The ecological roles of heterotrophic dinoflagellates in marine planktonic community. J. Eukaryot. Microbiol. 46, 390–396. Jeong, H.J., Du Yoo, Y., Lee, K.H., Kim, T.H., Seong, K.A., Kang, N.S., Lee, S.Y., Kim, J.S., Kim, S., Yih, W.H., 2013. Red tides in Masan Bay, Korea in 2004-2005: i. Daily variations in the abundance of red-tide organisms and environmental factors. Harmful Algae 30, S75–S88. Jeong, H.J., Lim, A.S., Franks, P.J., Lee, K.H., Kim, J.H., Kang, N.S., Lee, M.J., Jang, S.H., Lee, S.Y., Yoon, E.Y., et al., 2015. A hierarchy of conceptual models of red-tide generation: nutrition, behavior, and biological interactions. Harmful Algae 47, 97–115. Jeong, K.-S., Joo, G.-J., Kim, H.-W., Ha, K., Recknagel, F., 2001. Prediction and elucidation of phytoplankton dynamics in the Nakdong River (Korea) by means of a recurrent artificial neural network. Ecol. Model. 146, 115–129. Kampichler, C., Wieland, R., Calmé, S., Weissenberger, H., Arriaga-Weiss, S., 2010. Classification in conservation biology: a comparison of five machine-learning methods. Eco. Inform. 5, 441–450. Kim, J.H., Jeong, H.J., Lim, A.S., Rho, J.R., Lee, S.B., 2016. Killing potential protist predators as a survival strategy of the newly described dinoflagellate Alexandrium pohangense. Harmful Algae 55, 41–55. Kruk, C., Segura, A.M., 2012. The habitat template of phytoplankton morphology-based functional groups. Hydrobiologia 698, 191–202. Kuhn, M., Johnson, K., 2013. Applied Predictive Modeling. Springer. Kuncheva, L.I., 2014. Combining Pattern Classifiers: Methods and Algorithms, 2nd edition. Wiley Publishing. Lane, J.Q., Raimondi, P.T., Kudela, R.M., 2009. Development of a logistic regression model for the prediction of toxigenic pseudo-Nitzschia blooms in Monterey Bay,
54