Predicting the spatial distribution of seabed hardness based on presence/absence data using random forest
Jin Li*, Justy Siwabessy, Maggie Tran, Zhi Huang & Andrew D. Heap
National Earth and Marine Observations Group Environmental Geoscience Division *
[email protected]
Introduction Methods Results & Discussion Summary Acknowledgements Seabed hardness is an important environmental property for predicting marine biodiversity that can be used to support marine zone management in Australia. • Seabed hardness is one of the most important factors controlling the spatial distribution of benthic marine communities: • It influences the colonisation and formation of ecological communities and the abundance of benthic organisms • It may also influence the nature of attachment of an organism to the seabed • Hard substrates provide environments that generally support sessile suspension feeders and soft (unconsolidated) substrates generally support discretely motile invertebrates.
Australian Statistical Conference 2014, Sydney
Introduction Methods Results & Discussion Summary Acknowledgements • Despite its importance, seabed hardness data is difficult to acquire.
• Traditional methods of measuring seabed hardness with a cone penetrometer only provide data at point locations. • Seabed hardness can also be inferred based on underwater video footage that is however only available at a limited number of sampled locations and expensive. • Seabed hardness often inferred from multibeam backscatter data. Inferring seabed hardness over large areas has become possible with the recent widespread use and development of multibeam sonar systems for seabed mapping purposes. • In this study, we try to explore whether it is possible to predict it using video footage and multibeam backscatter data and its derived data.
Australian Statistical Conference 2014, Sydney
Introduction Methods Results & Discussion Summary Acknowledgements • To generate spatially continuous data of seabed hardness from point samples, spatial prediction methods are essential.
• Random forest (RF) is one of the top performing methods in predictive modeling. •
Random forest is an ensemble method that combines many individual regression or classification trees in the following way: from the original sample many bootstrap samples and portions of predictors are drawn, and an unpruned regression or classification tree is fit to each bootstrap sample using the sampled predictors. From the complete forest the status of the response variable is usually predicted as an average of the predictions of all trees for regression and as the classes with majority vote for classification (Breiman, 2001).
• Due to its high predictive accuracy, it was introduced into spatial statistics by applying it to continuous environmental data (Li et al., 2011a, 2011b, Sanabria et al 2013, Li and Heap 2014). • Such applications have significantly improved the prediction accuracy & may provide an alternative approach to model presence/absence data. Breiman, L., 2001. Random forests. Machine Learning 45 5-32. Li, J., Heap, A.D., Potter, A., Daniell, J., 2011. Application of machine learning methods to spatial interpolation of environmental variables. Environmental Modelling & Software 26 1647-1659. Li, J., Heap, A.D., Potter, A., Huang, Z., Daniell, J., 2011. Can we improve the spatial predictions of seabed sediments? A case study of spatial interpolation of mud content across the southwest Australian margin. Continental Shelf Research 31 1365-1376. Sanabria, L.A., Qin, X., Li, J., Cechet, R.P., Lucas, C., 2013. Spatial interpolation of McArthur’s forest fire danger index across Australia: observational study. Environmental Modelling & Software 50 37-50. Australian Statistical Conference 2014, Sydney
Introduction Methods Results & Discussion Summary Acknowledgements Given that RF has only been applied to continuous data for spatial predictions and seabed harness is usually inferred, a few questions remain, namely: • Is seabed hardness predictable? • Is RF data type-specific?
• How accurate are its predictions for presence/absence data?
Australian Statistical Conference 2014, Sydney
Introduction Methods Results & Discussion Summary Acknowledgements • To address above questions, in this study we used RF to predict the spatial distribution of seabed hardness based on • presence and absence data derived from video classification, and
Australian Statistical Conference 2014, Sydney
Introduction Methods Results & Discussion Summary Acknowledgements
Video classification Study region
In total, 140 samples of seabed hardness were considered in this study.
Australian Statistical Conference 2014, Sydney
Introduction Methods Results & Discussion Summary Acknowledgements • To address above questions, in this study we used RF to predict the spatial distribution of seabed hardness based on • presence and absence data derived from video classification, and • 15 seabed property predictors.
Australian Statistical Conference 2014, Sydney
Introduction Methods Results & Discussion Summary Acknowledgements Predictors 1. Bathymetry (bathy) 2. Seabed slope (slope), 3. Topographic relief (relief) 4. Surface area (surface) 5. Topographic position index (tpi) 6. Planar curvature (planar.curv) 7. Profile curvature (profile.curv) 8. Local Moran I (bathy.moran) 9. Backscatter (bs) 10. Homogeneity of backscatter (homogeneity) 11. Variance of backscatter (variance) 12. Local Moran I of backscatter (bs.moran) 13. Prock: the probability of hard substrate 14. Easting 15. Northing The first eight predictors are bathymetry and its derived variables. The next five predictors are backscatter and its derived variables. All these predictors were available at each grid cell to a 10 m resolution in the four study areas for predicting seabed hardness.
Australian Statistical Conference 2014, Sydney
Introduction Methods Results & Discussion Summary Acknowledgements • To address above questions, in this study we used RF to predict the spatial distribution of seabed hardness based on • presence and absence data derived from video classification and • 15 seabed property predictors. • The prediction accuracy was assessed using a 10-fold cross validation that was repeated 100 times.
Australian Statistical Conference 2014, Sydney
Introduction Methods Results & Discussion Summary Acknowledgements Results and discussion > dev.rf1