Machine learning methods: an application to by-catch data - CiteSeerX

1 downloads 0 Views 148KB Size Report
Washington, DC: Oceana. Benito Garzón, M., Blazek R., Neteler M., ... Heppell, S.S. Crowder, L.B., Menzel, T.R., 1999. Life table analysis of long-lived marine ...
SCRS/2008/038

Collect. Vol. Sci. Pap. ICCAT, 64(7): 2443-2454 (2009)

MACHINE LEARNING PROCEDURES: AN APPLICATION TO BY-CATCH DATA OF THE MARINE TURTLES CARETTA CARETTA IN THE SOUTHWESTERN ATLANTIC OCEAN Maite Pons 1, a, Soledad Marroni 2,3, Irene Machado 4, Badih Ghattas 5 & Andrés Domingo 1 SUMMARY In the present study we evaluate the performance of different Machine Learning Methods to predict the unreported data on Caretta caretta by-catch by the Uruguayan longline fishery in the Southwestern Atlantic Ocean. The methods evaluated were Classification And Regression Trees, Random Forest, CForest and Support Vector Machines, and was selected the model with minor predictive error rate. We used on board observed data to predict logbook unreported loggerhead by-catch during 1998 to 2007 using different explanatory variables. Random Forests and CForest were the method selected because its presents the minor predictive error rate. The Random Forest approach predicted a total capture of 13 065 and CForest 12 892 loggerhead turtles during the study period. We also evaluate the variable importance in the prediction for both methods. The year, type of fishing gear and month are the variables most important in the by-catch of loggerhead sea turtles. Machine Learning methods appear to be useful in the case where access to information is limited, particularly in fisheries where the information of the total catch recorded in logbooks is under-reported or missing altogether. RÉSUMÉ Dans la présente étude, nous évaluons les performances de diverses méthodes d’apprentissage automatique afin de prévoir les données non déclarées concernant la prise accessoire de Caretta caretta de la pêcherie palangrière uruguayenne dans l’Océan Atlantique Sud-Ouest. Les méthodes évaluées étaient les arbres de classification et de régression, les forêts aléatoires, CForest et les Machine à support de vecteurs, et le modèle a été sélectionné avec un taux d’erreur de prédiction mineur. Nous avons utilisé les données des observateurs embarqués à bord pour prévoir la prise accessoire de tortues caouannes non-déclarée dans les livres de bord, de 1998 à 2007, à l’aide de diverses variables explicatives. Les forêts aléatoires et CForest étaient les méthodes sélectionnées car elles présentaient un taux d’erreur de prédiction mineur. L’approche des forêts aléatoires prévoyait une capture totale de 13.065 et CForest prévoyait une capture totale de 12.892 tortues caouannes pendant la période à l’étude. Nous avons également évalué l’importance des variables dans la prédiction pour les deux méthodes. L’année, le type d’engin de pêche et le mois sont les variables les plus importantes dans la prise accessoire de tortues marines caouannes. Les méthodes d’apprentissage automatique s’avèrent utiles dans les cas où l’accès aux données est limité, notamment dans les pêcheries où l’information relative à la prise totale consignée dans les livres de bord est sous-déclarée ou totalement manquante. RESUMEN En este estudio se evalúa el rendimiento de diferentes Métodos de aprendizaje automático para predecir datos no comunicados sobre la captura fortuita de Caretta caretta realizada por la pesquería de palangre uruguaya en el Atlántico sudoccidental. Los métodos evaluados fueron Árboles de clasificación y regresión, Bosques aleatorios (Random Forest), CForest y Máquinas de soporte vectorial, y el modelo fue seleccionado con una tasa menor de error predictivo. Se utilizaron los datos de observadores embarcados para predecir la captura fortuita de Caretta caretta no declarada en los cuadernos de pesca durante 1998-2007 utilizando diferentes variables explicativas. Los métodos seleccionados fueron Bosques aleatorios (Random Forest) y CForest porque presentan una tasa menor de error predictivo. El enfoque de Bosques aleatorios (Random Forest) predijo una captura total de 13.065 y CForest predijo una captura de 12.892 tortugas bobas durante el periodo del estudio. Asimismo, se evaluó la importancia de la variable en la predicción de ambos métodos. El año, tipo de arte pesquero y mes son las variables más importantes en la captura fortuita de la tortuga boba. Los métodos de 2443

aprendizaje automático parecen ser útiles en los casos en los que el acceso a la información es limitado, especialmente en pesquerías en las que la información de la captura total consignada en los cuadernos de pesca está infracomunicada o no existe.

KEYWORDS Machine learning, pelagic longline, Loggerhead turtles, CPUE, Southwestern Atlantic

1. Introduction Five species of marine turtles occur in the southwestern Atlantic Ocean: Caretta caretta, Dermochelys coriacea, Chelonia mydas, Lepidochelys olivacea and Eretmochelys imbricata. All of these species are listed in Appendix I of the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES) and are classified as endangered or critically endangered on The World Conservation Union RedList (IUCN 2008). One of the main causes of mortality of juveniles and adult of marine turtles is the incidental capture in different fishing gears, especially those of pelagic longline (Spotila et al. 2000, Yeung 2001, Lewison et al. 2004, Lewison and Crowder 2007). Because of the characteristics of marine turtle’s life history, late age at maturity and low reproductive rates, (Heppell et al. 1999) these species are particularly vulnerable to the negative effects of by-catch. Pelagic longliners operate throughout all oceans targeting swordfish (Xiphias gladius), tunas (Thunnus spp.) and some shark species and in all parts of the world they capture different species as by-catch. In the Pacific Ocean the pelagic longline by-catch has been implicated as an important cause for regional declines in threatened sea turtle populations (Spotila et al. 2000). In the southwestern Atlantic Ocean interaction between marine turtles and pelagic longline has been widely reported (Achaval et al. 2000, Kotas et al. 2003, Domingo et al. 2006 a, b, Sales et al. 2008, López-Mendilaharsu et al. 2007). The two species that represent the highest capture rates in the region are the loggerhead (C. caretta) and leatherback (D. coriacea). This area registered the highest loggerhead CPUE values in the world oceans (López-Mendilaharsu et al. 2007). The Uruguayan longline fleet has been operating in the South Atlantic Ocean since 1981 (Rios et al. 1986). Capture information of target species are recorded in onboard logbooks by the captain, excluding by-catch data. With the implementation of the national observer program, “Programa Nacional de Observadores a bordo de la Flota Atunera” (PNOFA) by the “Dirección Nacional de Recursos Acuáticos” (DINARA) in 1998 (Mora & Domingo 2006), marine turtle by-catch data started to be recorded. Scientific observers collect data on local environmental conditions, details of fishing operations and catch by species (target, by-catch, discards and lost catch). By-catch and discards that occur in the remainder of the fleet’s vessels, those not observed, are up till now unknown. The use of Machine Learning procedures in biology are principally focused in Bioinformatics, statistical genomics and genetics diseases (Culhane et al. 2002, Bhardwaj et al. 2005, Díaz-Uriarte and Alvarez de Andrés 2006, Strobl et al. 2007), and some studies in ecology (Recknagel 2001, Benito Garzón et al. 2006, Shan et al. 2006, Peters et al. 2007) principally based on the utilization of artificial neural networks (Chon et al. 1996, Lek et al. 1996). The information on the use of these procedures in analyzing fishery data is still poor. Lennert-Cody & Berk (2007) implemented Random Forest to determine the unreported data of dolphin by-catch in purse-seine fisheries in the Pacific Ocean. Tserpes et al. (2006) analyzed data from the Greek swordfish fishing fleets in the eastern Mediterranean, by means of machine-learning approaches, in order to define differences in exploitation patterns and fishing strategies. Watters & Deriso (2000) used regression tree methods to analyze catches per unit of effort from the Japanese longline fishery for bigeye tuna in the central and eastern Pacific Ocean. Since the critical status of the populations of marine turtles and the small amount of reliable data available, estimates of these species total catch is of utmost importance. In the present study we evaluate the performance of different Machine learning methods in the prediction of unreported data on C. caretta by-catch by the Uruguayan longline fishery. Then we selected the best method and predicted the total loggerheads by-catch by the Uruguayan longliners in the Southwestern Atlantic Ocean. 2444

2. Materials and methods 2.1 Observer data Data was collected by scientific observers of the PNOFA between April 1998 and November 2007. The date, geographic position (latitude and longitude), effort (number of hooks), sea surface temperature (in Celsius degrees) and capture (number of individuals by specie), were recorded for each set. A total of 2 856 721 hooks were observed in 1 448 sets between 18º and 41º S and 20º and 54º W (Figure 1). As there exists two types of fishing gear in the Uruguayan longline fleet, we considered two categories: 1) American monofilament longline of polyamide and 2) Spanish multifilament longline of nylon. For details about fishing gear and operations of this fleet see Jiménez et al. (2009). 2.2 Logbook data For the same period and same area as mentioned above, the data from the fraction of the Uruguayan longline fleet, not observed, were analyzed. The information of the predictor variables were obtained from logbooks provide by DINARA. Sets without temperature and geographical position data were excluded for the analysis. A total of 6 637 sets were analyzed with an effort of 6 652 481 hooks. 2.3 Coverage of the observer program The spatial coverage of the PNOFA is well distributed in the total area of the operation of the longline fleet (Figure 1). The percentage of annual coverage of PNOFA, in relation to the total effort analyzed for the longline fleet varied between 3% and 71% with a total coverage of 30% within the study period. The smallest coverage was registered in 2000, while the largest coverage was registered in 2007. The minimum effort was registered in 2000 and the maximum in 2005 (Table 1). 2.4 Data analysis We applied Classification and Regression Trees (Breiman et al. 1984), Random Forests (Breiman 2001), CForest (Horton et al. 2006) and Support Vector Machines (Vapnik 1995) to predict the total loggerhead by-catch. A total of seven variables were used in the analysis for the prediction of the total catch of loggerhead turtles (Table 2). These variables include environmental and fishing gear characteristics. The number of turtles caught was the response, and since it is continuous, we considered regression procedures. We calculate also the Capture Per Unit of Effort (CPUE) as the number of C. caretta per 1 000 hooks (cc./1 000 hooks). All the marine turtles species capture by the Uruguayan longline fleet are totally discarded and in generally they are released alive. However we need to emphasize that in this paper we will predict only the total loggerhead bycatch, the mortality is not evaluated here. Classification and Regression Trees (CART) (Breiman et al., 1984) is a binary splitting method, which partitions recursively the data set into disjoint subgroups. It uses two algorithms, the first grows a maximal tree and the second prunes it to get the best subtree. For the maximal tree, observations in the learning data set are splitted iteratively into two sub-samples according to a binary rule like “temperature < 24°”. The splitting rule is based on one of the explanatory variables and on a threshold on this variable. It is chosen as the one that minimizes the heterogeneity of the obtained subsamples. When the output variable is discrete, classification trees are constructed and the criteria used are the “entropy” or the “gini” one, and for continuous outcome, regression trees are constructed using the “deviance” criterion. The two obtained sub-samples are then partitioned by the same way recursively until there are too few observations (usually five) in the obtained samples (other stopping rules are available). This gives a tree whose terminal nodes’ number may be too high. For regression trees the mean value of the output variable is assigned to each leaf, computed over the observations within the corresponding region. For classification trees the assigned class is the most frequent one within each leaf. These models have been widely studied in Machine Learning and applied statistics and present many advantages like their representation in form of a binary tree, working in high dimension and variables' ranking. Many extensions of regression and classification trees are available, see for instance Nerini & Ghattas (2007). 2445

Random forests (RF) (Breiman, 2001) is an ensemble method which aggregates K trees (forest) similar to the ones constructed with CART, each one being grown using a bootstrap sample of the original data set. Each tree in the forest uses at each node only a subset of the explanatory variables (in our case 2, as suggested by Liaw & Wiener (2002)). The prediction given by a RF is the mean of the predictions given by the K trees in the forest when using regression trees, or the majority vote for classification trees. RF can highly increase the prediction accuracy as compared to individual tree. The ensemble adjusts for the instability of the individual trees induced by small changes in the learning sample that impairs the prediction accuracy in test samples (Srtobl et al. 2002). CForest (CF) is also an ensemble method like RF, but the trees constructed are different from CART and follow the conditional inference trees (Horton et al. 2006). These trees optimize the splitting rule using statistical tests in contrast with CART which uses heterogeneity measures over the response variable. Conditional inference trees select variables in an unbiased way and the partitions induced by this recursive partitioning algorithm are not affected by overfitting (Horton et al. 2006). Support Vector Machines (SVM) (Vapnik, 1995) concept is based on looking for a linear separator between the classes, which should give a perfect classification, and which should maximize the “margins”, that is stay the farthest possible from the classes borders. When the data set is not linearly separable, it is mapped using a non linear function into a higher dimensional space (called feature space) where a linear separation should be possible. When using SVM, the non linear mapping does not need to be known explicitly as the Kernel function (scalar product between mapped observations) is sufficient to construct the separating function (Hastie et al. 2001). 2.5 Model selection In order to determine the best method that predicts the total catch of loggerheads by Uruguayan longliners, the original data are randomly divided into two datasets or samples. The first sample, 2/3 of the original data, is the training set or learning sample and is used to train the models. The second sample, 1/3 of the original data, is the test sample, or the evaluation set, which is used to validate the models. The validating process is developed by calculating the Mean Square Error (MSE), which is used when a continuous response like abundance is modeled. This procedure (training and validation subdivision) was repeated one-hundred times. Finally we calculated the average of the MSE of each model over the 100 test samples. The model which obtains the minor predictive error rate is selected for the prediction of the number of the unreported loggerheads by-catch data. 2.6 Variables importance The interpretability of RF or CF is not as easy as that of an individual classification tree generated by CART, where the influence of a predictor variable directly corresponds to its position in the tree. As a consequence, alternative measures for variable importance are required for the interpretation of RF and CF (Strobl et al. 2002). Variables importance in RF and CF is derived from its contribution along all the nodes and all the trees where it is used (Breiman, 2002). Strobl et al. (2007) suggest that, compared with RF, CF function provides unbiased variable selection in the individual classification trees even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The measure of variables importance, in RF, was calculated for the increase in percentage of MSE, while in CF, this measure was calculated for mean decrease accuracy. All analysis was carried out in R free software (Ihaka and Gentleman 1996, R Development Core Team 2007) using different packages: tree (Ripley, 2007.) for CART, randomForest (Liaw & Wiener, 2007) for RF, party (Hothorn et al., 2007) for CF and e1071 (Dimitriadou et al., 2007) for SVM. 3. Results 3.1 Observed data The capture of loggerhead turtles was registered in 755 sets, corresponding to a percentage of zeros in the observed data of 48%. A total of 2 261 loggerhead turtles were caught in the fraction of the fleet observed by 2446

PNOFA during 1998-2007 (Table 2) with a mean annual CPUE of 1.0 cc./1 000 hooks. The highest CPUE was observed in 2007 (2.0 cc./1 000 hooks) and the lowest in 2003 (0.4 cc./1 000 hooks). 3.2 Machine learning procedures The average of MSE obtained for each model in the selection process were: 0.021 for CART, 0.017 for RF, 0.018 for CF and 0.33 for SVM. The RF and CF methods preformed similarly, although RF presented the minor predictive error rate. SVM was the model that presented the highest predictive error. Therefore we chose both, RF and CF models, to predict the un-reported incidental capture of loggerhead sea turtles. RF predicted a total capture of 10 804 (95% CI = 8 758-12 850) turtles and in addition to those reported by PNOFA observers totaled 13 065 loggerheads turtles for the period between April 1998 and November 2007 (Table 1). RF explained 46% of the variability of the data. CF predicted a total capture of 10 631 (95% CI = 3 778-17 484) turtles and in addition to those reported by PNOFA totaled 12 892 loggerheads. As RF had the minor mean predictive error rate and more enclose confidence interval, we used this model to calculate the predicted loggerhead CPUE. These allow compare the estimation with the observed CPUE by PNOFA and the total loggerhead CPUE. The results are shown in the Figure 2. The curves demonstrate a similar tendency, with exception of the period 1999-2000 and 2004-2007 where tendencies between the observed and the total predicted CPUE are reversed. Except for 1998, all the CPUE values were greater than those observed by PNOFA (Figure 2). The biggest differences occurred in 2006 with a observed CPUE of 0.6 cc./1000 hooks and a corresponding total estimated value of 2.1 cc./1000 hooks. 3.3 Variables Importance The percentage of increase in MSE in RF and the mean decrease accuracy in CF show that some predictors were more important than others but in different order between both methods. According to an increasing degree of importance the variables used by RF modeling were the year, month, latitude, temperature, longitude, effort and gear; and for CF were the year, gear, month, temperature, latitude, effort and longitude (Figure 3). 4. Discussion 4.1 Prediction Consistent data are required for the effective management of fisheries and the observer programs are the most reliable source of information. In this paper we evaluated the performance of different machine learning methods to estimate the unreported data on C. caretta by-catch in longline fisheries. Of these methods two were the most accurate, RF and CF with a little difference in the prediction error rate. This shows that both methods are useful in the generation of a model for our data set with a best performance of RF. This method is a powerful statistical tool that has found many applicants in various scientific areas (Strobl et al. 2002). It has been applied in a wide variety of problems such as studies for complex genetic diseases, the prediction of phenotypes based on amino acid or DNA sequences, (Strobl et al. 2002, Cho and Won 2003) and in ecological studies related with predicting forest areas (Benito Garzón et al. 2006), occurrence of different types of vegetation (Peters et al. 2007) or spatial distribution of an endangered Australian marsupial (Shan et al. 2006), between others. Díaz Uriarte and Alvarez de Andrés (2006) and Benito Garzón et al. (2006) also obtained in their study a better performance of RF in comparison to other methods (SVM, K Nearest Neighbors) for real and simulated DNA data. 4.2 Variables importance In general the measures of importance are useful to select predictors when the numbers of variables are large and can be used to drive the decision on which and how many predictor variables have to select in a certain problem. Here we used this approach to understand how each variable influences loggerheads sea turtles catch. Both methods sort the variables in different orders. The RF and CF variable importance measure indicated that the year is the most important variable consecutive with month and latitude in RF and gear and month in CF. However the gear appeared as the last important variable in RF (Figure 3). Strobl et al. (2007) suggested the use of CF procedure for the evaluation of variable importance if the potential predictors vary in their number of categories or scale level, as in our case. The gear variable has only two categories and may be this is the reason that appeared as less important in RF. This method is based on the Gini split criterion that is known to prefer variables with more categories in variable selection (Breiman 1984, Strobl et al. 2006). 2447

4.3 Final remarks As a consequence that exists a large variability in the catch of loggerhead turtles associate to different fishing zones, seasons, type of fleets, and even environmental factors is unviable the estimation of the total catch of these specie by a direct extrapolation method. Yeung (2001) considered this variability and used Generalized Linear Models (Delta-Lognormal models) to estimate the loggerhead by-catch by the US longline fishery in 2000. However Lewison et al. (2004) estimated the number of incidental loggerhead and leatherbacks catches in the world oceans assuming that CPUE was homogeneous. The percent of coverage of the PNOFA varied widely between years, presenting low coverage in 2000 and 2001 and large coverage in 2007. Babcock et al. (2003) suggest, by simulation studies, that coverage levels of at least 20 percent for common species, and 50 percent for rare species, would give reasonably good estimates of total by-catch. But the required level of coverage depends for a particular fishery, according on the size of the fishery, distribution of catch and by-catch, and spatial stratification of the fishery. Given this, the estimation of total loggerheads by-catch in the years with low effort of observation must be taking with caution. The quality of reported data is important to assess population status. Machine Learning procedures and the implementation of a well rounded observer program can help produce reliable data. As an example of this, Lennert-Cody and Berk (2007) implemented RF to determine the unreported data of dolphin by-catch in purseseine fisheries in the Pacific Ocean. Machine Learning procedures seem to be useful in the case where access to information is limited (e.g. fisheries) where information of the total captures recorded in logbooks is under-reported or missing altogether. The application of a prediction model with said characteristics, that estimates the un-reported catch, is very important for the implementation of fisheries management and conservation of species. Our study estimated that the total capture of loggerheads is six times greater than what is currently reported by observers, which is extremely relevant for this species placed in the IUCN RedList as vulnerable (IUCN, 2007). Also this value is underestimated since there were sets that were not considered in the study that were set by the Uruguayan fleet during this period. In the most case of fishing management, the evaluations of stocks populations are based on logbooks data. Then, we suggest that Machine Learning approximations can be used to correct logbook information in relation to more reliable sources such as data collected by on-board observers not only for by-catch but also for target species.

Acknowledgements A los observadores científicos del PNOFA, capitanes, marineros y armadores de los barcos. A Mathias Bourel, Caren Barceló y Stella Weng por la traducción y comentarios.

References Acha, E.M., Mianzan, H.W., Guerrero, R.A., Favero,M., Bava, J., 2004. Marine fronts at the continental shelves of austral South America physical and ecological processes. J. Marine Syst. 44, 83– 105. Achaval F., Marin,Y.H. & Barea, L.C., 2000. Captura incidental de tortugas marinas con palangre pelágico oceánico en el Atlántico Sud- occidental. En: G. Arena & M. Rey. (Eds.). Captura de grandes peces pelágicos (pez espada y atunes) en el Atlántico Sud-occidental, y su interacción con otras poblaciones. INAPE – PNUD URU/92/003. Pp. 83-88. Montevideo, Uruguay. Babcock, E.A., Pikitch, E.K., Hudson C.G., 2003. How much observer coverage is enough to adequately estimate by-catch? Washington, DC: Oceana. Benito Garzón, M., Blazek R., Neteler M., Sánchez de Diosa R., Sainz Ollero, H., Furlanello C., 2006. Predicting habitat suitability with machine learning models: The potential area of Pinus sylvestris L. in the Iberian Peninsula. Ecol. Model. 197, 383-393. 2448

Bhardwaj, N., Langlois, R. E., Zhao, G., Lu, H., 2005. Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res. 33, 6486-6493. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Trees. New York: Chapman and Hall. New York. Breiman, L., 2001. Random Forests. Mach. Learn. 45, 5-32. Breiman, L., 2002. Manual on Setting Up, using, and understanding Random Forests [http://www.stat.berkeley.edu/users/breiman/RandomForests/cc.home.htm].

v3.1.

Cho, S. B., Won, H.H., 2003. Machine Learning in DNA Microarray Analysis for Cancer Classification. Conferences in Research and Practice in Information Technology Series, 33. Chon, T.-S., Park, Y.S., Moon, K.H., Cha, E.Y., 1996. Patternizing communities by using artificial neural network. Ecol. Model. 90, 69-78. Culhane A.C., Perriere, G., Considine, E.C., Cotter, T.G., Higgins, D.G., 2002. Between-group analysis of microarray data. Bioinformatics. 18, 1600-1608. Diaz-Uriarte, R., Alvarez de Andrés, S., 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 7, 3. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A., 2007. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. [http://cran.r-project.org/src/contrib/Descriptions/ e1071.html] [R package version 1.5-17]. Domingo A., L. Bugoni, Prosdocimi L., Miller P., Laporta M., Monteiro D.S., Estrades A., Albareda D., 2006a, El impacto generado por las pesquerías en las tortugas marinas en el Océano Atlántico sud occidental. WWF Programa Marino para Latinoamérica y el Caribe, San José, Costa Rica. Domingo, A., Sales G., Giffoni B., Miller, P., Laporta, M., Maurutto G., 2006b. Distribución y composición de tallas de las tortugas marinas (Caretta caretta y Dermochelys coriacea) que interactúan con el palangre pelágico en el atlántico sur. Collect. Vol. Sci. Pap. ICCAT, 59: 992-1002. Garrison, L.P., 2005. Estimates By-catch of Marine Mammals and Turtles in the U.S. Atlantic Pelagic Longline Fleet during 2004. NOAA Technical Memorandum NMFS-SEFSC-531, 57 p. Hastie, T.J., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer. Heppell, S.S. Crowder, L.B., Menzel, T.R., 1999. Life table analysis of long-lived marine species with implications for conservation and management. In: Musick J. A., (Ed.). Life in the Slow Lane: Ecology and Conservation of Long-Lived Marine Animals. American Fisheries Society Symposium, 23. Bethesda, MD. Ihaka, R., Gentleman, R., 1996. R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5, 299– 314. Hothorn,T., Hornik K., Zeileis A., 2006, Unbiased Recursive Partitioning: A Conditional Inference Framework. J. Comput. Graph. Stat. 15, 651-674. Hothorn T., Hornik, K., Zeileis, A., 2007. party: A Laboratory for Recursive Part(y)itioning [http://cran.rproject.org/src/contrib/Descriptions/party.html] [R package version 0.9-93]. IUCN, 2008. IUCN 2008 List of Threatened Species. A global species assessment. [http://www.redlist.org] Jiménez, S., Domingo, A., Brazeiro, A., 2009. Seabird by-catch in the Southwest Atlantic: interaction with the Uruguayan pelagic longline fishery. Polar Biol. 32, 187-196. 2449

Kotas, J.E., Dos Santos, S., De Azevedo, V.G., Gallo, B.M.G., Barata. P.C.R., 2003. Incidental capture of loggerhead (Caretta caretta) and leaderback (Dermochelys coriacea) sea turtles by the pelagic longline fishery off southern Brazil. Fish. Bull. 102, 393-399. Lek, S., Delacoste, M., Baran, P., Dimonopoulos, I., Lauga, J., Aulagnier, J., 1996. Application of neural networks to modelling nonlinear relationships in ecology. Ecol. Model. 90, 39–52. Lennert-Cody, C.E., Berk, R. A., 2007. Statistical learning procedures for monitoring regulatory compliance: an application to fisheries data. J. R. Statist. Soc. A. 170, 671-689. Liaw, A., Wiener, M., 2007. randomForest: Breiman and Cutler's random forests for classification and regression [http://cran.r-project.org/src/contrib/Descriptions/ randomForest.html] [R package version 4.522]. Liaw, A., Wiener, M., 2002. Classification and regression by random forest. R News 2/3, 18–22. Lewison, R. L., Freeman, S.A., Crowder, L.B., 2004. Quantifying the effects of fisheries on threatened species: the impact of pelagic longlines on loggerhead and leatherback sea turtles. Ecol. Lett. 7, 221–231. Lewison, R. L, Crowder, L. B., 2007. Putting Longline By-catch of Sea Turtles into Perspective. Conserv. Biol. 21, 79-86. López-Mendilaharsu, M., Sales, G., Giffoni, B., Miller, P., Niemeyer Fiedler, F., Domingo, A., 2007. Distribución y composición de tallas de las tortugas marinas (Caretta caretta y Dermochelys coriacea) que interactúan con el palangre pelágico en el Atlántico Sur. Collect. Vol. Sci. Pap. ICCAT. 60: 2094-2109. Mora, O, Domingo, A., 2006 Informe sobre el Programa de Observadores a bordo de la flota atunera uruguaya (1998 –2004). Collect. Vol. Sci. Pap. ICCAT, 59(2): 559-607. Nerini, D., Ghattas B., 2007,. Classifying densities using functional regression trees: Applications in oceanology. Comput. Stat. Data An. 51, 4984-4993. Peters, J., De Baets, B., Verhoest N.E.C., Samson, R., Degroeve, S., De Becker P., Huybrechts ,W., 2007. Random forests as a tool for ecohydrological distribution modeling. Ecol. Model. 207, 304–318. R Development Core Team. 2007. R: A Language and Environment for Statistical Computing. [http://CRAN.Rproject.org/]. R Foundation for Statistical Computing, Vienna, Austria. Recknagel, F., 2001, Applications of machine learning to ecological modelling. Ecological Model., 146, 303– 310. Rios, C., Leta R., Mora, O., Rodríguez, J., 1986. La pesca de atunes y especies afines por parte de la flota daltura palangrera uruguaya. Ier. Simp. Cient. CTMFM, Mar del Plata, Argentina 1984, 1, 483-544. Ripley, B., 2007. tree: Classification and regression Descriptions/tree.html] [R package version 1.0-26].

trees.

[http://cran.r-project.org/src/contrib/

Sales, G., Giffoni, B., Barata ,P., 2008. Incidental catch of sea turtles by the Brazilian pelagic longline fishery. J. Mar. Biol. Assoc. UK. 88, 853–864. Shan,Y., Paull, D., McKayc, R.I., 2006. Machine learning of poorly predictable ecological data. Ecol. Model. 195, 129–138. Spotila, J.R., Reina, R.R., Steyermark, A.C., Plotkin, P.T., Paladino., F.V., 2000. Pacific leatherback turtles FACE extinction. Nature. 405, 529-530. Strobl. C., Boulesteix. A.L, Augustin. T., 2006. Unbiased split selection for classification trees based on the Gini Index. Comput. Stat. Data An. 52, 483-501. 2450

Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T., 2007. Bias in random variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics. 8, 25. Tserpes, G., Moutopoulos, D.K., Peristeraki, P., Katselis, G., Koutsikopoulos, C., 2006. Study of swordfish fishing dynamics in the eastern Mediterranean by means of machine-learning approaches. Fish. Res. 78, 196-202. Vapnik, V., 1995. The Nature of Statistical Learning Theory. Springer-Verlag, New York. Watters, G., Deriso, R., 200., Catch per unit of effort of bigeye tuna: a new analysis with regression trees and simulated annealing. Bull. Inter-Amer. Trop. Tuna Comm. 21(8): 527-571. Yeung, C., 2001. Estimates of marine mammal and marine turtle by-catch by the U.S. Atlantic pelagic longline fleet in 1999-2000. NOAA Technical Memorandum. NMFS-SEFSC-467, 43p.

2451

Table 1. Percentage of the total effort of fleet observed by PNOFA and estimation of the number of turtles captured by the Uruguayan longline fleet in the southwestern Atlantic Ocean during April 1998 through November 2007.

Year

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 Total

Total effort (in thousands of hooks)

544 553 386 520 564 1 274 1 747 1 934 1 345 643 9 509

% observed by PNOFA

No. of turtles observed by PNOFA

No. of turtles estimated by RF

No of turtles estimated by CF

10.6 12.0 3.1 6.4 11.6 33.0 45.2 25.9 33.9 70.8 30.0

82 56 7 66 95 163 338 282 261 911 2 261

506 722 539 1 116 950 1 035 1 206 1 637 2 578 515 10 804

521 652 653 1 013 784 886 1 076 1 409 3 080 557 10 631

2452

Total No. of turtles capture estimated by RF 588 778 546 1 182 1 045 1 198 1 544 1 919 2 839 1 426 13 065

Total No. of turtles capture estimated by CF 603 708 660 1 079 879 1 049 1 414 1 691 3 341 1 468 12 892

2453

Table 2. Variables used in the analysis of the prediction of loggerhead by-catch. Variable

Type

Observations

Year

Categorical (10)

Period: 1998-2007

Month

Categorical (12)

January-December

Sea Surface Temperature

Continuous

In ºC (range: 9-30º)

Latitude

Continuous

In decimal scale

Longitude

Continuous

In decimal scale

Fishing gear

Categorical (2)

1: American monofilament 2: Spanish multifilament

Effort

Continuous

In number of hooks (range: 120-3 000 hooks)

Brazil

Uruguay

Southwest Atlantic

Figure 1. Accumulate effort in 1ºx1º grids observed by PNOFA and total realized by the uruguayan longline fleet between April 1998 and November 2007 in southwestern Atlantic Ocean. The black circles represent the total effort realized by the fleet and the grey circles the effort observed by PNOFA in the same scale.

2454

3,5

CPUE (cc./1000 hooks)

observed

predicted

total

3 2,5 2 1,5 1 0,5 0 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007

Year

Figure 2. Annual loggerhead CPUE (cc./1 000 hooks) observed by PNOFA, predicted by RF and total CPUE estimated.

a)

RF

year month latitude temp longitude effort gear 15

20

25

30

35

%IncMSE

b)

CF

year gear month temp latitude effort longitude 1

2

3

4

5

MDecA

Figure 3. Variable importance plot generated by RF (a) and by CF (b). The ranked variable importance is measured by the increased in percentage of mean square error (%IncMSE) in RF and by the mean decrease accuracy (MDecA) in CF. Temp are used as abbreviation of temperature.

2455

Suggest Documents