Machine learning algorithms improve the power ... - Wiley Online Library

JSE

Journal of Systematics and Evolution

doi: 10.1111/jse.12258

Research Article

Machine learning algorithms improve the power of phytolith analysis: A case study of the tribe Oryzeae (Poaceae) Zhe Cai1,2 and Song Ge1,2* 1

State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China University of Chinese Academy of Sciences, Beijing 100049, China *Author for correspondence. E-mail: [email protected]. Tel.: þ86-10-62836097. Fax: þ86-10-62590843. Received 23 March 2017; Accepted 7 May 2017; Article first published online 18 July 2017

2

Abstract Phytoliths, as one of the important sources of microfossils, have been widely used in paleobotanyrelated studies, especially in the grass family (Poaceae) where abundant phytoliths are found. Despite great efforts, several challenges remain when phytoliths are used in various studies, including the accurate description of phytolith morphology and the effective utilization of phytolith traits in taxon identification or discrimination. In this study, we analyzed over 1000 phytolith samples from 18 taxa representing seven main genera in the tribe Oryzeae (subfamily Ehrhartoideae) and five taxa in the subfamilies Bambusoideae and Pooideae. By focusing on Oryzeae, which has been extensively investigated in terms of taxonomy and phylogeny, we were able to evaluate the discrimination power of phytoliths at lower taxonomic levels in grasses. With the help of morphometric analysis and by introducing several machine learning algorithms, we found that 87.7% of the phytolith samples could be classified correctly at the genus level. In spite of slightly different performances, all four machine learning algorithms significantly increased the resolving power of phytolith evidence in taxon identification and discrimination compared with the traditional phytolith analysis. Therefore, we propose a pipeline of phytolith analyses based on machine learning algorithms, including data collection, morphometric analysis, model building, and taxon discrimination. The methodology and pipeline presented here should be applied to various studies across different groups of plants. This study provides new insights into the utilization of phytoliths in evolutionary and ecology studies involving grasses and plants in general. Key words: machine learning, morphological character, phytolith, Poaceae, taxon discrimination.

1 Introduction Recent decades have witnessed extraordinary pace of change of evolutionary biology as our understanding of biology advances and many new technologies and analytical approaches developed. However, the most direct evidence in interpreting evolutionary history is found in the fossil record, from which we understand great episodes of extinction and diversification in history and the existence of innumerable creatures either extinct or extant (Taylor et al., 2009; Futuyma, 2013). Phytoliths are silicified plant tissues, where soluble silica from the groundwater has been deposited. They are an important source of microfossils because they can survive for a long time in sediment after the decay of plants (Piperno, 2006; Ball et al., 2016). The morphology of phytoliths is taxon-specific in many plant groups (Rapp & Mulholland, 1992; Piperno, 2006) and has been increasingly used in studies of numerous disciplines, substantially extending our knowledge about these microfossils (reviewed by Hart, 2016, and Zurro et al., 2016). July 2017 | Volume 55 | Issue 4 | 377–384

Despite great progress, several challenges remain when phytoliths are used in various studies. First is the accurate description of the morphology of phytoliths, which is a prerequisite for application of phytolith information. Traditionally, morphological traits of a phytolith were mainly the shape of phytoliths, as described by some specific terms (e.g., dumbbell, saddle, and rondel) (Madella et al., 2005). Phytoliths described by the same term are assumed to have similar shape and be more likely to originate from related taxa (Mulholland & Rapp, 1992; Piperno & Pearsall, 1998). However, this practice introduced bias in describing morphological variation of phytoliths, especially when distinctly related taxa were involved. Moreover, it is difficult to compare results among studies with these terms, because different authors might have different understandings and treatments in their own studies. Second, taxon identification or discrimination is another critical step in phytolith studies and this practice relies heavily on the experience of the specialists who have obtained long-term and professional training (Hart, 2016; Zurro et al., 2016). In this sense, artificial biases are most likely to be introduced in phytolith analyses © 2017 Institute of Botany, Chinese Academy of Sciences

378

Cai & Ge

such that quantitative discrimination with morphometric approaches have been strongly recommended (Ball et al., 2016; Evett & Cuthrell, 2016), which indeed have been used successfully in many cases (Whang et al., 1998; Zhao et al., 1998; Rovner & Gyulai, 2007). Nevertheless, how to make full use of morphometric traits in phytoliths and how to efficiently analyze the inherent information are still big challenges. Since the publication of the “Period of Expanding Applications (of Phytoliths)” (Hart, 2016), the crucial task for the utilization of phytoliths in evolutionary studies is to overcome the limitations and improve the discrimination power of phytolith evidence. Essentially speaking, to connect the morphological description of phytolith samples to the taxon identity is a process of supervised learning (Mohri et al., 2012), in which the discrimination abilities or skills are learned based on the description of the phytoliths extracted from extant plants whose identities are known. This process can be implemented automatically with some well-developed machine learning algorithms that significantly improve the efficiency of training and eliminate inherent human bias (Jain et al., 2000). Importantly, the accuracy of these algorithms increases with the number of phytolith samples, which in turn promotes the sharing of phytolith morphological data among scholars and studies (Hart, 2016; Zurro et al., 2016). The grass family (Poaceae) includes many species with great economic and ecological importance. Grasses cover one-fifth of the Earth’s land surface, and their evolutionary history has attracted widespread interest (GPWG, 2001; Soreng et al., 2015). However, there is relatively little early paleontological evidence on the history of herbaceous species such as grasses, and phytoliths have become an important source of evidence in evolutionary studies of grasses as microfossils (Crepet & Feldman, 1991; GPWG, 2001). In the grass family, where phytoliths are abundant in the sediments, phytolith evidence has been widely used in taxon identification at different hierarchical levels (Mulholland & Rapp, 1992; Zhao et al., 1998; Prasad et al., 2005; Str€ omberg, 2005; Barboni & Bremond, 2009; Piperno et al., 2009) and involved in many research fields, including crop domestication and origin of agriculture (Pearsall, 1978; Zhao et al., 1998; Harvey & Fuller, 2005; Lu et al., 2009), reconstruction of paleovegetation and paleoclimate (Twiss, 1992; Iriarte, 2006; Ghosh et al., 2008; Gu et al., 2012), and origin of specific €mberg, groups of grasses (Prasad et al., 2005, 2011; Stro 2005). However, it is still a challenge to use phytolith evidence at lower taxonomic levels because of the difficulties in morphological description and quantitative discrimination of phytoliths. In the present study, we collected phytoliths from extant species at different taxonomic levels in the grass family and evaluated the discrimination power of phytoliths at lower taxonomic levels. We particularly focused on the tribe Oryzeae, which has been extensively investigated in terms of taxonomy and phylogeny (Ge et al., 2002; Guo & Ge, 2005; Tang et al., 2010, 2015) and thus become a good model system for phytolith studies. With the help of morphometric analysis (Adams et al., 2004) and by introducing machine learning algorithms we obtained high resolving power of phytolith evidence in grasses and propose a pipeline of phytolith analyses. The pipeline, particularly the machine learning algorithms, should be applied to various studies across different groups of plants. J. Syst. Evol. 55 (4): 377–384, 2017

2 Material and Methods 2.1 Plant sampling and phytolith extraction We collected leaf samples for phytolith extraction from 23 taxa in the grass family (Table 1), with 0.2 g dried leaves sampled for each taxon. Eighteen taxa were sampled from seven main genera in the tribe Oryzeae (subfamily Ehrhartoideae). An additional five samples represented the subfamilies Bambusoideae and Pooideae. These three subfamilies belong to the BEP clade of grasses (Wu & Ge, 2012) and serve as a valuable control system because the discrimination power of phytoliths has been widely recognized at the subfamily level in €mberg, 2004). grasses (Piperno & Pearsall, 1998; Stro The term “phytolith” in the present study refers to the short cell phytolith that has been proved to be the most important type of phytolith and is widely used in taxon discrimination in grasses (Mulholland & Rapp, 1992; € mberg, 2004; Piperno, 2006). Wet oxidation is a routine Stro method of phytolith extraction from extant plants (Piperno, 2006). However, in our preliminary survey with this method, we found that the oxidation degree of samples was not sufficiently complete, so that the phytoliths and some tissue remains were often conjunctive. This led to difficulties in precise morphological screening because of the unclear boundary of the phytoliths. In contrast, under the condition of high temperature and pressure from microwave heating, we obtained completely digested tissues and individual phytoliths without organic matter that could be accounted easily. Therefore, we used the microwave digestion method to extract the phytoliths in this study. The detailed protocol is available with the Mars 5 Microwave Digestion System (CEM, Matthews, NC, USA). 2.2 Morphometric analysis The observation and measurement of phytoliths were conventionally carried out under an optical microscope given its high efficiency and low cost. However, it is generally difficult to detect the boundary of phytoliths under an optical microscope due to its low depth of focus (Evett & Cuthrell, 2016). Therefore, in addition to microscope, we observed and imaged the morphology of phytoliths using a Hitachi S-4800 Field Emission Scanning Electron Microscope (SEM; Hitachi, Tokyo, Japan) with high resolution and depth of focus. The accelerating voltage and magnification were set to 10 kV and 3000 to 5000, respectively. Two different morphometric methods, traditional (Marcus, 1990; Henderson, 2006) and outline-based (Ferson et al., 1985; Andrade et al., 2010), were used to describe the morphological variation of phytoliths. The outline-based method was generally used at the higher taxonomic level to detect the boundary of phytoliths with elliptic Fourier descriptors (EFDs; Kuhl & Giardina, 1982), performed by the software SHAPE version 1.3 (Iwata & Ukai, 2002). Of numerous EFDs, we chose to use the first nine principal components of the normalized EFDs (Rohlf & Archie, 1984) in our analysis at the subfamily level. The traditional method was routinely used at the lower taxonomic level for analyzing samples from different genera. We obtained 10 linear distance measurements of lobe and shank of the phytoliths (L1 to L10 in Fig. 1) for 18 taxa representing different genera of Oryzeae using Image-Pro Plus version 6.0 (Media Cybernetics, Rockville, MD, USA). Instead of using the measurements directly, we calculated and www.jse.ac.cn

A new method of phytolith analysis

379

Table 1 Samples used in this study Taxon Chikusichloa aquatica Koidz. Chikusichloa mutica Keng Leersia hexandra Sw. Leersia lenticularis Michx. Leersia oryzoides (L.) Sw.† Leersia perrieri (A. Camus) Launert Leersia sayanuka Ohwi Leersia tisserantii (A. Chev.) Launert Leersia tisserantii (A. Chev.) Launert Luziola leiocarpa Lindm. Potamophila parviflora R. Br. Zizania aquatica L. Zizania latifolia (Griseb.) Turcz. ex Stapf. Zizaniopsis villanensis Quarin Oryza sativa L. ssp. indica† Oryza sativa L. ssp. japonica Oryza rufipogon Griff. Oryza nivara S. D. Sharma & Shastry re) J. Houz. Phyllostachys edulis (Carrie Bambusa emeiensis L. S. Chia & H. L. Fung Poa annua L. Calamagrostis epigeios (L.) Roth Bromus inermis Leyss. †

Subfamily

Accession No.

Origin

Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Ehrhartoideae Bambusoideae Bambusoideae Pooideae Pooideae Pooideae

106186 DYGX 105252 002-03 200201-200210 105164 06GQL 105610 101384 82043 85424 Wen 8584 GS0202 85425 2540 55471 81991 89215 307 310 104 106 108

Japan Guangxi, China Philippines Louisiana, USA Guangdong, China Madagascar Zhuxi, Hubei, China Cameroon Guinea Argentina Australia USA Beijing, China Argentina Japan South Korea Myanmar Cambodia Yunnan, China Yunnan, China Yunnan, China Neimenggu, China Xinjiang, China

Samples used for analysis at the subfamily level.

used 10 ratios between these measurements (Fig. 1) (Adams et al., 2004) in subsequent analyses to avoid the repeated use of size information. One additional parameter that sums the total length and width of a phytolith was recorded to describe the size of a phytolith (Fig. 1). To decrease the bias from the difference in the number of accounting samples, 50 phytoliths per taxon were randomly selected for statistical analysis. Together, we used 11 traditional and

Fig. 1. Illustration of linear measurements of short cell phytoliths and ratio parameters used in the analysis. L1 to L10 are 10 traditional linear measurements of short cell phytoliths. Ratio and size parameters are listed on the right. www.jse.ac.cn

nine outline-based traits to identify and discriminate the phytolith samples using routine statistical methods including ANOVA, hierarchical cluster, and principal component analysis (PCA). 2.3 Taxon discrimination with machine learning algorithms Machine learning algorithms were developed to simulate human’s thinking and judgment ability and represent the core of artificial intelligence (Mohri et al., 2012; Jordan & Mitchell, 2015). Some well-known examples are fingerprint/ face recognition (Hsu et al., 2002), text categorization (Sebastiani, 2002), driverless cars (Gupta & Merchant, 2016), and AlphaGo (Silver et al., 2016). The learning process using category-known samples in these algorithms is equivalent to the long-term and professional training of phytolith experts. We adopted four well developed and widely used machine learning algorithms and evaluated the power of these algorithms in taxon identification and discrimination at the lower taxonomic level. The four algorithms were decision tree (DT; Breiman et al., 1998), k-nearest neighbors (KNN; Altman, 1992), support vector machine (SVM; Cortes & Vapnik, 1995), and multi-layer perceptron neural networks (Ripley & Hjort, 1996). Model building for each algorithm consists of both learning and validation processes (Jordan & Mitchell, 2015). Specifically, the morphometric data (with identities known) were randomly divided into the learning set (80%) and the validation set (20%). The prediction error rate was calculated by comparing the real and predicted identities of samples using the validation dataset. This process was repeated until the error rate was converged to a minimum value and the optimal model parameters were found. These model fitting processes J. Syst. Evol. 55 (4): 377–384, 2017

380

Cai & Ge

were accomplished under SPSS Statistics version 17.0, PCP version 2.2 (Buturovic, 2006), and R packages e1071 and nnet (Venables & Ripley, 2002). The false-positive rate was computed and compared for each algorithm.

3 Results We imaged 1063 phytolith samples using both the optical microscope and SEM and found that the images from SEM were easily retrieved and better resolved with clear boundary and texture. We therefore undertook our morphometric analyses of phytoliths based on the SEM images. Using traditional morphological descriptions of phytoliths, we found that the phytoliths from different subfamilies could be identified easily based on their shape, with the “bilobate” type occurring in Ehrhartoideae, “rondel” type in Pooideae, and “bilobate- or saddle-like shape with irregular structure” in Bambusoideae (Piperno & Pearsall, 1998). We then used the outline-based method to describe the phytolith morphology so that the machine learning method could be based on the constructed contour. As shown in Fig. S1, the shape of a typical phytolith can be modeled with the EFDs (green curves) and the fitting precision increased with the number of the EFDs; however, the increase was not significant when more than five EFDs were used. Using PCA analysis of EFDs, we could also detect the trends of phytolith variations in the samples. For example, variation trends could be detected along the first three principal components (Fig. 2), and a large proportion of variations (shown by PC1) arose from the width, with a shape being bilobate-like to roundel-like. Using these principle components, the discrimination power of phytoliths at the

subfamily level is 83.6% on average, ranging from 81.5% to 86.8% depending on the algorithms (Table 2). This result indicated that the machine learning method obtained as high as 80% discrimination power at the subfamily level where the traditional morphological description and discrimination worked efficiently as well. To test for the utility of the machine learning algorithm at the lower taxonomic level, we repeated these procedures for different genera in Oryzeae. First, we used the traditional morphometric method (linear measurements) to collect the morphological variation information and undertook summary statistical analyses on the samples from different genera (Table S1). The ANOVA analysis showed that, out of 210 pairwise comparisons between genera, 162 (77%) were significantly different (P < 0.05), indicating that the 11 traits comprised effective information for taxon discrimination at the genus level. However, with PCA and the hierarchical clustering method, samples from different genera were unable to be identified easily, with all of them mixed together. Second, we undertook the discrimination analyses using four machine learning algorithms and found that the average discrimination power is 80.8% (from 75.2% to 87.7%), corresponding to that at family level (Table 2, detailed results listed in Table S2). In addition, we found that the algorithm with high discrimination power appears to have a low variation range of accuracy rate among genera. For example, the SVM algorithm that achieved the highest discrimination power (87.7%) generated the minimum variation range of accuracy rate among genera (7%), whereas the DT algorithm with the lowest accuracy rate (75.2%) gained a much larger variation range of accuracy rate (15.9%) (Table 2). These observations indicated that, although different algorithms of machine learning might perform differently, the overall discrimination power is very high. Note that different discrimination power for a specific algorithm existed among genera. In Luziola, the average

Table 2 Proportion of phytoliths that were classified correctly by using machine learning algorithms at the subfamily and genus levels Taxon

Fig. 2. Shape diversity and variation trends along principle components (PC) extracted from elliptic Fourier descriptors in Bambusoideae. The visualization was achieved by calculating the mean and standard deviation (S.D.) of the elliptic Fourier descriptor coefficients (Iwata & Ukai, 2002). J. Syst. Evol. 55 (4): 377–384, 2017

Subfamily level Bambusoideae Ehrhartoideae Pooideae Average across subfamily S.D. of accuracy rate Genus level Chikusichloa Leersia Luziola Potamophila Zizania Zizaniopsis Oryza Average across genus S.D. of accuracy rate

DT

KNN

SVM

MLP

96.2% 83.7% 80.5% 86.8% 8.3%

70.9% 88.4% 85.4% 81.5% 9.3%

83.5% 86.0% 81.7% 83.8% 2.2%

81.0% 83.7% 81.7% 82.1% 1.4%

88.7% 88.2% 90.5% 78.4% 71.2% 48.9% 60.4% 75.2% 15.9%

88.7% 62.7% 88.1% 86.5% 67.3% 91.1% 83.3% 81.1% 11.3%

98.1% 76.5% 92.9% 83.8% 90.4% 86.7% 85.4% 87.7% 7.0%

86.8% 68.6% 90.5% 81.1% 61.5% 82.2% 83.3% 79.2% 10.3%

DT, decision tree; KNN, k-nearest neighbors; MLP, multiplelayer perceptron neural networks; SVM, support vector machine. www.jse.ac.cn

A new method of phytolith analysis accuracy rate was up to 90% in all algorithms except for KNN. High power was also achieved in Chikusichloa and Potamophila, with accuracy rates above 80%. In comparison, less than 75% of the phytolith samples collected from Leersia and Zizania were assigned to their own taxa. Even for the same genus, different algorithms performed quite differently (Table 2). For example, the accuracy rate of discrimination in Zizaniopsis ranged from 48.9% (DT algorithm) to 91.1% (KNN algorithm). Of the four algorithms, SVM had the best performance at the genus level with the consistently highest accuracy rate. Finally, we calculated the false-positive rate, an important criterion for discrimination performance. For all four algorithms, average false-positive rates were less than 0.05; the lowest value was 0.021 from the SVM algorithm (Table 3). These results indicated that the false-positive rate was well controlled while both the higher sensitivity and lower specificity were satisfied in the discrimination.

4 Discussion Phytoliths, as a kind of microfossil, have greatly promoted our knowledge of plant evolution and related fields. Short cell phytoliths, in particular, are one of the most important lines of evidence in grasses and have been widely used at higher taxonomic levels (e.g., for subfamilies) (Str€ omberg, 2004; Piperno, 2006; Ball et al., 2016; Zurro et al. 2016). However, identification and discrimination have often been difficult at lower taxonomic levels (e.g., for genera and species), either because the shape information defined by some terms and routinely used at higher taxonomic levels might be convergent in orgin, and thus not informative, or because quantitative descriptions and analyses of phytoliths have been lacking. In the present study, by collecting multiple morphological traits including linear distances and the pairwise ratios between them, we were able to precisely describe the morphological features, which are taxon-specific and thus informative for taxon identification and discrimination. Unlike the morphological terms used previously, these morphometric data clearly showed the morphological variation of phytoliths within taxa by proper statistical and visualization tools. Therefore, we successfully detected distinct phytoliths from different genera in Oryzeae, despite these phytoliths belonging to the same type described by morphological terms. Importantly, the artificial bias derived from different researchers under various training levels might be excluded

381

with the introduction of image processing techniques. The improved morphometric method not only helps to capture the morphological variation of phytoliths, but also provides an effective way to quantitatively and automatically identify and eventually classify the phytoliths according to their identities of taxa. In addition to the informative description of phytoliths, taxon identification and discrimination of phytoliths based on their morphology are critical in phytolith-related investigations, but include steps where artificial biases might arise because these practices require much experience and professional training. Here we introduced several welldeveloped machine learning algorithms into phytolith identification and discrimination for the first time. Using the Oryzeae species as the working system, we obtained a high power of taxon discrimination for phytoliths at genus level, with up to 87% of the phytoliths being successfully classified into the taxa from which they were derived. Such power corresponds to that at the subfamily level (Table 2) where reliability of phytolith analysis was very high when conventional approaches were used. As reported by studies in a variety of fields (Jordan & Mitchell, 2015), these machine learning algorithms reduced the dependence on researchers’ experience and skills and avoided potential artifacts as well. It is worthwhile mentioning that the four algorithms we evaluated performed slightly differently on the same taxa and have different power for different taxa (Table 2). The SVM algorithm achieved the optimal performance globally in terms of accuracy rate and false-positive rate. The KNN algorithm performed the best in the genera Potamophila and Zizaniopsis. Therefore, attempts with multiple algorithms are strongly recommended in future studies, although SVM and KNN performed better in our case. Standardization and automatization in phytolith studies have been proposed by many authors (Ball et al., 2016; Evett & Cuthrell, 2016; Hart, 2016; Zurro et al., 2016). Based on the high power of morphometric analysis and machine learning algorithms, we propose a pipeline for phytolith analysis at lower taxonomic levels (for genera and even species or subspecies) (Fig. 3). The pipeline consists of three steps: morphometrics, model building, and discrimination. In the first step, plant materials are sampled uniformly from extant taxa according to the phylogeny of the target groups (to avoid being over-represented). Then phytoliths are extracted through microwave digestion and imaged under SEM, after preliminary processing on the images (e.g., increased contrast) to make the boundary clear. The morphological parameters can be selected from traditional

Table 3 False-positive rate of different algorithms for genera of Oryzeae

DT KNN SVM MLP Average across genera

Chikusichloa

Leersia

Luziola

Potamophila

Zizania

Zizaniopsis

Oryza

Average across algorithms

0.0145 0.0218 0.0145 0.0218 0.0182

0.1552 0.0325 0.0289 0.0397 0.0641

0.0140 0.0245 0.0000 0.0245 0.0157

0.0241 0.0275 0.0137 0.0172 0.0206

0.0580 0.0362 0.0507 0.0543 0.0498

0.0035 0.0283 0.0177 0.0495 0.0247

0.0214 0.0571 0.0179 0.0429 0.0348

0.0415 0.0326 0.0205 0.0357

DT, Decision Tree; KNN, k-nearest neighbors; MLP, multiple-layer perceptron neural networks; SVM, support vector machine. www.jse.ac.cn

J. Syst. Evol. 55 (4): 377–384, 2017

382

Cai & Ge

Fig. 3. Proposed pipeline for taxon discrimination of phytoliths using machine learning algorithms. Recommended methods and tools (e.g., support vector machine (SVM) algorithm and SHAPE or Image-Pro Plus software) are also listed below each step. SEM, scanning electron microscopy.

linear measurements or outline data, as mentioned above. To retrieve additional information we suggest using more complex parameters (e.g., textures and curvature) to increase the discrimination power. In the model building step, several major algorithms are applied and compared. Cross-validation in parameter selection must be used to avoid the over-fitting between datasets and model/classifier. Once the classifier is constructed, the phytolith samples under study can be measured and discriminated with the optimal model and parameters in the last discrimination step. Moreover, data resulting from this pipeline can be effectively shared and compared among different studies, similar to the bioinformatics databases, which thus improves the power of phytolith evidence in subsequent studies. In principle, the pipeline applies to the study at higher taxonomic levels as well, although the characters or traits used might be different. In a recent review, Jordan & Mitchell (2015) presented current progress in machine learning and its impacts on numerous theoretical and application areas (Hsu et al., 2002; Sebastiani, 2002; Gupta & Merchant 2016; Silver et al., 2016). Our study illustrates the advantages of machine learning algorithms in phytolith analyses in the grass family. In principle, the methodology and pipeline presented here should apply to investigations of any groups of plants, using phytoliths as the line of evidence.

Conflict of Interest

Acknowledgements

Buturovic LJ. 2006. PCP: A program for supervised classification of gene expression profiles. Bioinformatics 22: 245–247.

We thank Yu-Fei Wang, Wen-Li Chen, and members of the Ge laboratory for their valuable discussions and suggestions. We also thank the International Rice Research Institute (Los Banos, Philippines) for providing seed samples of some Oryzeae species. This work was supported by the National Natural Science Foundation of China (Grant Nos. 91231201 and 30990240) and the CAS/SAFEA International Partnership Program for Creative Research Teams.

Cortes C, Vapnik V. 1995. Support-vector networks. Machine Learning 20: 273–297.

J. Syst. Evol. 55 (4): 377–384, 2017

The authors declare no conflict of interest.

References Adams DC, Rohlf FJ, Slice DE. 2004. Geometric morphometrics: Ten years of progress following the ‘revolution’. Italian Journal of Zoology 71: 5–16. Altman NS. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46: 175–185. Andrade IM, Mayo SJ, Kirkup D, Van den Berg C. 2010. Elliptic Fourier Analysis of leaf outline shape in forest fragment populations of Anthurium sinuatum and A. pentaphyllum (Araceae) from Northeast Brazil. Kew Bulletin 65: 3–20. Ball TB, Davis A, Evett RR, Ladwig JL, Tromp M, Out WA, Portillo M. 2016. Morphometric analysis of phytoliths: Recommendations towards standardization from the International Committee for Phytolith Morphometrics. Journal of Archaeological Science 68: 106–111. Barboni D, Bremond L. 2009. Phytoliths of East African grasses: An assessment of their environmental and taxonomic significance based on floristic data. Review of Palaeobotany and Palynology 158: 29–41. Breiman L, Friedman JH, Olshen RA, Stone CJ. 1998. Classification and regression trees. Wadsworth: Chapman & Hall/CRC.

Crepet WL, Feldman GD. 1991. The earliest remains of grasses in the fossil record. American Journal of Botany 78: 1010–1014. Evett RR, Cuthrell RQ. 2016. A conceptual framework for a computer-assisted, morphometric-based phytolith analysis and classification system. Journal of Archaeological Science 68: 70–78.

www.jse.ac.cn

A new method of phytolith analysis

383

Ferson S, Rohlf FJ, Koehn RK. 1985. Measuring shape variation of two-dimensional outlines. Systematic Biology 34: 59–68.

Pearsall DM. 1978. Phytolith analysis of archeological soils: Evidence for maize cultivation in formative Ecuador. Science 199: 177–178.

Futuyma DJ. 2013. Evolution. Sunderland: Sinauer Associates.

Piperno DR. 2006. Phytoliths: A comprehensive guide for archaeologists and paleoecologists. Lanham: AltaMira Press.

Ge S, Li A, Lu BR, Zhang SZ, Hong DY. 2002. A phylogeny of the rice tribe Oryzeae (Poaceae) based on matK sequence data. American Journal of Botany 89: 1967–1972. Ghosh R, Gupta S, Bera S, Jiang HE, Li X, Li CS. 2008. Ovi-caprid dung as an indicator of paleovegetation and paleoclimate in northwestern China. Quaternary Research 70: 149–157.

Piperno DR, Pearsall DM. 1998. The silica bodies of tropical American grasses: Morphology, taxonomy, and implications for grass systematics and fossil phytolith identification. Washington, D.C.: Smithsonian Institution Press.

GPWG (Grass Phylogeny Working Group). 2001. Phylogeny and subfamilial classification of the grasses (Poaceae). Annals of the Missouri Botanical Garden 88: 373–457.

Piperno DR, Ranere AJ, Holst I, Iriarte J, Dickau R. 2009. Starch grain and phytolith evidence for early ninth millennium B.P. maize from the Central Balsas River Valley, Mexico. Proceedings of the National Academy of Sciences USA 106: 5019–5024.

Gu Y, Wang H, Huang X, Peng H, Huang J. 2012. Phytolith records of the climate change since the past 15000 years in the middle reach of the Yangtze River in China. Frontiers of Earth Science 6: 10–17.

Prasad V, Stromberg CAE, Alimohammadian H, Sahni A. 2005. Dinosaur coprolites and the early evolution of grasses and grazers. Science 310: 1177–1180.

Guo YL, Ge S. 2005. Molecular phylogeny of Oryzeae (Poaceae) based on DNA sequences from chloroplast, mitochondrial, and nuclear genomes. American Journal of Botany 92: 1548–1558.

Prasad V, Stromberg CAE, Leache AD, Samant B, Patnaik R, Tang L, Mohabey DM, Ge S, Sahni A. 2011. Late Cretaceous origin of the rice tribe provides evidence for early diversification in Poaceae. Nature Communications 2: 480.

Gupta A, Merchant PS. 2016. Automated lane detection by k-means clustering: A machine learning approach. Electronic Imaging 2016: 1–6.

Rapp G, Mulholland SC. 1992. Phytolith systematics: Emerging issues. New York: Plenum Press.

Hart TC. 2016. Issues and directions in phytolith analysis. Journal of Archaeological Science 68: 24–31.

Ripley BD, Hjort NL. 1996. Pattern recognition and neural networks. Cambridge: Cambridge University Press.

Harvey EL, Fuller DQ. 2005. Investigating crop processing using phytolith analysis: The example of rice and millets. Journal of Archaeological Science 32: 739–752.

Rohlf FJ, Archie JW. 1984. A comparison of Fourier methods for the description of wing shape in mosquitoes (Diptera: Culicidae). Systematic Zoology 33: 302–317.

Henderson A. 2006. Traditional morphometrics in plant systematics and its role in palm systematics. Botanical Journal of the Linnean Society 151: 103–111.

Rovner I, Gyulai F. 2007. Computer-assisted morphometry: A new method for assessing and distinguishing morphological variation in wild and domestic seed populations. Economic Botany 61: 154–172.

Hsu R-L, Abdel-Mottaleb M, Jain AK. 2002. Face detection in color images. IEEE Transactions on Pattern Analysis and Machine Intelligence 24: 696–706.

Sebastiani F. 2002. Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34: 1–47.

14

Iriarte J. 2006. Vegetation and climate change since 14,810 C yr B.P. in southeastern Uruguay and implications for the rise of early Formative societies. Quaternary Research 65: 20–32. Iwata H, Ukai Y. 2002. SHAPE: A computer program package for quantitative evaluation of biological shapes based on elliptic Fourier descriptors. Journal of Heredity 93: 384–385. Jain AK, Duin RPW, Jianchang M. 2000. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22: 4–37. Jordan MI, Mitchell TM. 2015. Machine learning: Trends, perspectives, and prospects. Science 349: 255–260. Kuhl FP, Giardina CR. 1982. Elliptic Fourier features of a closed contour. Computer Graphics and Image Processing 18: 236–258. Lu H, Zhang J, Liu KB, Wu N, Li Y, Zhou K, Ye M, Zhang T, Zhang H, Yang X, Shen L, Xu D, Li Q. 2009. Earliest domestication of common millet (Panicum miliaceum) in East Asia extended to 10,000 years ago. Proceedings of the National Academy of Sciences USA 106: 7367–7372. Madella M, Alexandre A, Ball T. 2005. International code for phytolith nomenclature 1.0. Annals of Botany 96: 253–260. Marcus L. 1990. Traditional morphometrics. In: Rohlf J, Bookstein F eds. Proceedings of the Michigan Morphometrics Workshop. Special Publication no. 2. Ann Arbor: University of Michigan Museum of Zoology.

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529: 484–489. Soreng RJ, Peterson PM, Romaschenko K, Davidse G, Zuloaga FO, Judziewicz EJ, Filgueiras TS, Davis JI, Morrone O. 2015. A worldwide phylogenetic classification of the Poaceae (Gramineae). Journal of Systematics and Evolution 53: 117–137. Str€ omberg CAE. 2004. Using phytolith assemblages to reconstruct the origin and spread of grass-dominated habitats in the great plains of North America during the late Eocene to early Miocene. Palaeogeography, Palaeoclimatology, Palaeoecology 207: 239–275. Str€ omberg CAE. 2005. Decoupled taxonomic radiation and ecological expansion of open-habitat grasses in the Cenozoic of North America. Proceedings of the National Academy of Sciences USA 102: 11980–11984. Tang L, Zou XH, Achoundong G, Potgieter C, Second G, Zhang DY, Ge S. 2010. Phylogeny and biogeography of the rice tribe (Oryzeae): Evidence from combined analysis of 20 chloroplast fragments. Molecular Phylogenetics and Evolution 54: 266–277. Tang L, Zou XH, Zhang LB, Ge S. 2015. Multilocus species tree analyses resolve the ancient radiation of the subtribe Zizaniinae (Poaceae). Molecular Phylogenetics and Evolution 84: 232–239.

Mohri M, Rostamizadeh A, Talwalkar A. 2012. Foundations of machine learning. Cambridge: MIT Press.

Taylor EL, Taylor TN, Krings M. 2009. Paleobotany: The biology and evolution of fossil plants. Salt Lake City: Academic Press.

Mulholland SC, Rapp G. 1992. A morphological classification of grass silica-bodies. In: Rapp G, Mulholland SC eds. Phytolith systematics: Emerging issues. Boston: Springer US. 65–89.

Twiss P. 1992. Predicted world distribution of C3 and C4 grass phytoliths. In: Rapp G, Mulholland SC eds. Phytolith systematics: Emerging issues. New York: Plenum Press. pp. 113–128.

www.jse.ac.cn

J. Syst. Evol. 55 (4): 377–384, 2017

384

Cai & Ge

Venables WN, Ripley BD. 2002. Modern applied statistics with S. New York: Springer. Whang SS, Kim K, Hess WM. 1998. Variation of silica bodies in leaf epidermal long cells within and among seventeen species of Oryza (Poaceae). American Journal of Botany 85: 461–466. Wu ZQ, Ge S. 2012. The phylogeny of the BEP clade in grasses revisited: Evidence from the whole-genome sequences of chloroplasts. Molecular Phylogenetics and Evolution 62: 573–578. Zhao Z, Pearsall DM, Benfer RA, Piperno DR. 1998. Distinguishing rice (Oryza sativa Poaceae) from wild Oryza species through phytolith analysis, II: Finalized method. Economic Botany 52: 134–145. Zurro D, Garcıa-Granero JJ, Lancelotti C, Madella M. 2016. Directions in current and future phytolith research. Journal of Archaeological Science 68: 112–117.

J. Syst. Evol. 55 (4): 377–384, 2017

Supplementary Material The following supplementary material is available online for this article at http://onlinelibrary.wiley.com/doi/10.1111/ jse.12258/suppinfo Fig. S1. Reconstructed contour with different numbers (1 to 12 from top-left to bottom-right) of elliptic Fourier descriptors (EFDs) of a typical phytolith. Green curve is the fitted outline of a phytolith with corresponding number of EFDs. Red curve is the fitted outline using one EFD and served as a reference. Table S1. Summary statistics of linear measurements of phytoliths for genera of Oryzeae. Table S2. Detailed discrimination results in Oryzeae using four machine learning algorithms.

www.jse.ac.cn

Machine learning algorithms improve the power ... - Wiley Online Library

Machine learning algorithms improve the power ... - Wiley Online Library

Suggest Documents