Identification of species based on DNA barcode using ...

3 downloads 0 Views 1MB Size Report
Jul 5, 2016 - DNA barcoding is a molecular diagnostic method that allows automated and ... Machine; CBOL, Consortium for the Barcode Of Life; COI, ...
Gene 592 (2016) 316–324

Contents lists available at ScienceDirect

Gene journal homepage: www.elsevier.com/locate/gene

Methodological paper

Identification of species based on DNA barcode using k-mer feature vector and Random forest classifier Prabina Kumar Meher a, Tanmaya Kumar Sahu b, A.R. Rao b,⁎ a b

Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India

a r t i c l e

i n f o

Article history: Received 2 May 2016 Received in revised form 2 July 2016 Accepted 4 July 2016 Available online 5 July 2016 Keywords: Oligomer frequency Random forest DNA barcode SPIDBAR BOLD systems

a b s t r a c t DNA barcoding is a molecular diagnostic method that allows automated and accurate identification of species based on a short and standardized fragment of DNA. To this end, an attempt has been made in this study to develop a computational approach for identifying the species by comparing its barcode with the barcode sequence of known species present in the reference library. Each barcode sequence was first mapped onto a numeric feature vector based on k-mer frequencies and then Random forest methodology was employed on the transformed dataset for species identification. The proposed approach outperformed similarity-based, treebased, diagnostic-based approaches and found comparable with existing supervised learning based approaches in terms of species identification success rate, while compared using real and simulated datasets. Based on the proposed approach, an online web interface SPIDBAR has also been developed and made freely available at http://cabgrid.res.in:8080/spidbar/ for species identification by the taxonomists. © 2016 Elsevier B.V. All rights reserved.

1. Introduction DNA barcoding is an exciting tool in the area of taxonomic research (www.dnabarcodes.org). DNA barcode (Hebert et al., 2003) is a very short fragment of DNA sequence that has been successfully used for the identification of species to which a plant, animal or fungus belongs. In animal, a fragment of ~ 650 base pairs (bp) near the 5′-terminus of the cytochrome c oxidase I (COI) gene (Hebert et al., 2003) has been accepted as DNA barcode. However, in fungi the internal transcribed spacer (ITS) (Shaw et al., 2007) nuclear ribosome sequence is found to be more appropriate as barcode as compared to COI (Seifert, 2009). In plant species, barcoding is more difficult in which the COI has been found to be ineffective due to the slow rate of evolvement of mitochondrial genome (Kress and Erickson, 2007). The Consortium for the barcode of life (CBOL) has recommended that the chloroplast genes rbcL (Kress et al., 2005) and mat K (Chase et al., 2007) can be used as plant barcode (Hollingsworth et al., 2009; Liu et al., 2011). Moreover, the species identification success rate is high in animals as compared to plants (Dinca et al., 2011).

Abbreviations: NJ, Neighbor Joining; NN, Nearest Neighbor; BLAST, Basic Local Alignment Search Tool; PAR, Parsimony; RF, Random Forest; SVM, Support Vector Machine; CBOL, Consortium for the Barcode Of Life; COI, Cytochrome c Oxidase I; ITS, Internally Transcribed Spacer; BOLD, Barcode of Life Data. ⁎ Corresponding author. E-mail addresses: [email protected] (P.K. Meher), [email protected] (T.K. Sahu), [email protected] (A.R. Rao).

http://dx.doi.org/10.1016/j.gene.2016.07.010 0378-1119/© 2016 Elsevier B.V. All rights reserved.

DNA barcoding can be helpful in many real life applications. It can be used to identify the agricultural pests at any stage of life and thus can be easier to control them. Natural resource manager can use the barcode to monitor the illegal trade of products processed from rare species. Therefore, the CBOL is collaborating with govt. agencies, NGOs, researchers for collection of barcode specimen as well development of new methods to analyze the barcodes, among which the statistical techniques for assigning an unknown specimen to a known species is an important one (Bertolazzi et al., 2009). Moreover, hundreds of thousands of reference barcodes for tens of thousands of species have already been generated in earlier barcode projects, where the barcode of an unknown species can be compared with the reference barcode to identify the matching species. For this task, several computational techniques have been proposed in earlier studies (Liu et al., 2011; Austerlitz et al., 2009; Van Velzen et al., 2012; Weitschek et al., 2013; Weitschek et al., 2014a) and each of them has their own advantages and disadvantages. For instance, similarity and tree-based methods depend upon sequence alignment. Diagnostic-based methods though did not relay upon sequence alignment and higher success rate of species identification over similarity and tree-based approaches, the accuracies are still less than that of machine learning based approaches. Though, these approaches have achieved N90% species identification success rate, still there is a need for further improvement in success rate. In this study, an attempt has been made to develop a computational approach for identifying the matching species by comparing the DNA barcode of an unknown specimen with the reference barcode library. The developed approach involves two steps, first, each barcode

P.K. Meher et al. / Gene 592 (2016) 316–324

sequence was transformed into a numeric feature vector based on the composition of k-mer frequencies, and second, the Random forest (RF) (Breiman, 2001) machine learning classifier was employed on the numeric feature vector for assigning the unknown specimen (barcode sequences) into pre-defined species. The species identification success rate of the developed approach was found comparable with the existing approaches, while compared using empirical and simulated datasets. 2. Materials and methods 2.1. Barcode sequence dataset Two empirical datasets (ED-I, ED-II) and one simulated dataset (SD-I) were used to evaluate the performance of the developed approach as well as to perform the comparative analysis. These datasets are available at http://dmb.iasi.cnr.it/blog.php and have also been used in earlier studies (Van Velzen et al., 2012; Weitschek et al., 2014a). ED-I consists of barcode sequences of several species belonging to trees, Bats, Fishes and Birds. The barcode sequences present in ED-II (Cypraeidae, Drosophila, Inga) are more diverged as they come from three different phyla i.e., Inga from Plantae, Cypraeidae from Mollusca and Drosophila from Arthropoda. Besides, the barcode sequences in ED-II come from three different genomic regions i.e., COI, trnTD and ITS. The SD-I comprises three different categories of simulated barcode sequences corresponding to three different effective population sizes having 1000, 10,000 and 50,000 individuals. In each category, 100 sets are present, where each set comprises 1000 sequences belonging to 50 species (each species having 20 individuals). Summary of the datasets is provided in Supplementary Table S1. 2.2. Feature extraction The machine learning classifiers take numerical inputs (Zhang et al., 2006) but the information on barcode sequence are in the form of nucleotide strings (i.e., A, T, G, C). Therefore, the barcode sequences need to be mapped onto numeric feature vectors before being used as input in such classifiers. In this study, frequencies of oligo-nucleotide strings of different scale k were used to map the barcode sequences onto numeric vectors. The procedure is explained as follows: For a given barcode sequence of length L, the number of possible oligo-nucleotide strings of R consecutive bases α1, α2, …, αR is (L − R + 1), where αi є {A, T, G, C}. Let n(α1, α2, …, αR) be the number of times the string α1, α2, …, αR appears in the barcode sequence, by sliding through the sequence, shifting one nucleotide position at a time. Then, the probability of string α1, α2, …, αR appearing in the sequence can be computed as n(α1, α2, …, αR)/(L − R + 1). For instance, in the DNA sequence ‘TGAGGTTTGTTTACGGTGAT’, p(A) = 3/20, p(TT) = 4/ (20 − 2 + 1) and p(TTT) = 2/(20 − 3 + 1). There are 4k and ∑

4k features possible for single scale k (i.e., k = 1 or 2 or 3, etc.)

k¼a;b;c; :::

and multiple scale k (i.e., k = (1, 2), (1, 3), (1, 3, 4) etc.) respectively. Similar type of features have also been used in other areas of genomic research such as clustering of biological sequences (Chu et al., 2009), splice site prediction (Li et al., 2012), classification of bacterial genome (Weitschek et al., 2014b) and classification of selectively constrained DNA elements (Polychronopoulos et al., 2014). 2.3. Random forest classifier For classification purpose, the RF supervised learning method (Breiman, 2001) was adopted because (i) RF is an ensemble of treebased classifiers that performs better than individual base classifier, and (ii) RF is non-parametric, robust to noise, does not over-fit and capable of handling large datasets (Breiman, 2001). The RF has been frequently used in other areas of bioinformatics and computational

317

biology as well. A brief description about working principle of RF is as follows: For a given dataset with N observations and p features (explanatory variables), RF consists of ensemble of several classification trees (Breiman et al., 1984), where each tree is grown upon a bootstrap sample of the original dataset. Tree construction procedure in RF is different than classification and regression tree (CART) (Breiman et al., 1984) with respect to splitting of nodes i.e., if there are p input variables, a number m (bbp) is specified such that at each node, m variables are selected at random out of p and the best split (which is decided on the basis of reduction in impurity) on these m is used to finally split the node. The value of m is kept constant during the forest growing and each tree is grown to the maximum extent possible without pruning. To classify a new query (test) sequence, which is an input vector with p features, the input vector is dropped down on each of the tree in the forest and each tree (classifier) gives a classification label (vote) to the query instance. Finally, the forest classifies the query instance into the class (label) having the most votes. For more details about RF, one can refer Breiman (2001).

2.4. Parameter configuration in Random forest Two parameters, namely, mtry (number of variables to be chosen at random at each node of the classification tree) and ntree (number of classification trees to be grown in the forest) were optimized to get higher accuracy in RF. Fifty percent sequences of ED-I were used to optimize the parameters in RF. The values of parameters, where the minimum and stable misclassification error rates obtained, were considered as optimum. The same optimum parameters were also used for assessing the performance of RF in ED-II and SD-I.

2.5. Implementation of Random forest For implementing RF classifier, the “randomForest” package of R-software (Liaw and Wiener, 2002) was installed on a 32 bit windows XP system having 2 GB RAM. The function randomForest was used to train the model with arguments mtry = “optimum mtry” and ntree = “optimum ntree”, where “optimum mtry” and “optimum ntree” were obtained as explained in the previous subsection “Parameter configuration in Random forest”.

2.6. Training and validation In ED-I and ED-II datasets, 80% of sequences of each taxonomic entity were used as training dataset and the remaining 20% sequences (with known membership) was used for validation. For SD-I, 800 sequences of each simulation run was used to train the model and the remaining 200 sequences were used to assess the model accuracy. The number of sequences used for training and validation is provided in Supplementary Table S2. The training and validation setups were similar to earlier studies (Van Velzen et al., 2012; Weitschek et al., 2014a). Since there were more than two classes (species), multi-class RF model was used. The encoded training datasets were used to train the model with optimum parameter setting and the corresponding validation sets were used to assess the performance of the classifier in terms of species identification success rate, which is defined as follows: Let th be the number of query sequences belong to the hth species, where h = 1, 2, …, H and let ch be the number of query instances correctly assigned into hth species. Then, the species identification H

H

h¼1

h¼1

success rate of the model can be computed as ∑ ch =∑ t h .

318

P.K. Meher et al. / Gene 592 (2016) 316–324

2.7. Comparison with existing approaches The performance of the proposed approach was compared with four different categories of approaches proposed in earlier studies i.e., treebased approaches: Neighbor Joining (NJ) and Parsimony (PAR) (Van Velzen et al., 2012), similarity-based approaches: BLAST and Nearest Neighbor (NN) (Van Velzen et al., 2012), Diagnostic-based approaches: DNA-BAR (Das Gupta et al., 2005) and BLOG 2.0 (Weitschek et al., 2013) and supervised learning based approaches: SVM, C4.5, RIPPER and Naïve Bayes (Weitschek et al., 2014a). In tree-based method, the barcode sequences of query set are assigned to known species of reference library on the basis of their cluster membership with the barcode sequences of reference library. In case of similarity based methods, an unknown specimen is assigned to a known species of reference set provided that the unknown specimen has maximum number of nucleotide matches with that particular species as compared to others. In diagnostic-based methods, an unknown specimen is assigned to a particular known species based on the fact that the presence or absence of certain nucleotides at certain positions is hold similar in both the sequences i.e., the sequence of query set and sequence of reference set. Comparison among different approaches was made in terms of species identification success rate. 2.8. Computation of running time The running time (in seconds) of both rule-based (tree/similarity/ diagnostic) and supervised learning approaches were computed. A system having the configuration of CPU Model-Intel(R) Core(TM) 2 Duo, Processor-2.00 GHz, RAM-2GB with the Operating System (OS)Windows Seven Royale XP SP3 2010 was used for this purpose. However, for DNA-BAR Ubuntu 11.04 OS was used with same hardware configuration. Specific scripts and commands were used for computing the running time. The detailed summary of the approaches and the time computed for the purposes are given in Table 1.

The classification results are displayed separately for reference and query datasets. 3. Results 3.1. Optimum values of parameters Considering the size of k-mer up to 4 bp long, both single-scale and multi-scale features were extracted from 50% sequences of ED-I. Then, RF model was executed by using all the generated single-scale and multi-scale features. The RF model was executed with the mtry (√p, as recommended by Breiman (2001)) and ntree was kept at 500. More clearly, mtry was set at √ 43 for k = 3 (single-scale), √(42 + 43) for k = (2, 3) (multi-scale) and so on. The misclassification errors obtained after executing the model are given in Table 2. From Table 2 it is observed that the average misclassification error is minimum at multiscale k = (1, 4) and (3, 4) as compared to the other values of k (both single-scale and multi-scale). Further, decrease in misclassification error rate is observed with increase in the single-scale features i.e., the error rates are 0.181, 0.081, 0.047 and 0.039 for k = 1, 2, 3 and 4 respectively. But, similar trend is not observed in case of multi-scale features. For instance, misclassification error is less in case of multi-scale k = (1, 4) (i.e., 41 + 44 number of features) as compared to multi-scale k = (1, 2, 3, 4) (i.e., 41 + 42 + 43 + 44 number of features). Moreover, though the average misclassification error is observed to be approximately same for multi-scale k = (1, 4) and (3, 4), we have considered k = (1, 4) because of less number of features i.e., 41 + 44 (b43 + 44). For k = (1, 4), it is further seen that the misclassification errors are almost stable after 100 classification trees for all the taxonomic entities (Fig. 1). Thus, the final RF model was executed by using 260 (41 + 44) features with parameters mtry = √260 and ntree = 500. Though the errors are observed to be stabilized after 100 trees (Fig. 1), 500 classification trees were grown in the final model anticipating further improvement in accuracy. 3.2. Performance analysis using ED-I

2.9. Web server Based on the proposed approach, we have developed a web server for assigning an unknown specimen into previously known species present in the reference library. The server is developed using HTML, PHP and R-code. Here, the developed R-code is executed in the background upon the submission of reference barcode sequences and query barcode sequences in FASTA format in which the names of the sequences are in the format of Barcode of Life Data (BOLD) Systems (http://www.boldsystems.org/) (Ratnasingham and Hebert, 2007). Here, to submit the sequence(s), the facilities for both pasting the sequence(s) in text areas as well as uploading FASTA files are provided. Table 1 Summary of the approaches and the time computed for the purposes. Approaches

Time taken for

SVM, RF, RIPPER, C4.5 NJ

training the model and predicting the species labels for the test instances executing the nj function (available in R package APE (Paradis et al., 2004)) computing the distance using dist.dna function (available in R package APE) constructing the most parsimonious tree (out of 10 trees) using MEGA6 (Tamura et al., 2013) running the blast program offline by using training sequences as database and test sequences as query set.

NN PAR BLAST (Altschul et al., 1990) (Version 2.2.22) DNA-BAR BLOG 2.0

extracting the distinguisher fragment using “degenbar” (Das Gupta et al., 2005). extracting the rules as well as to assign the species label to the test sequences

A multi-class RF model was trained independently in each category (Cypraeidae, Drosophila, Inga, Bats, Fishes and Birds) by using the respective training set with optimum parameters setting (i.e., mtry = √ 260 & ntree = 500), and the corresponding validation set was used to assess the species identification success rate. Based on the same dataset, the identification success rate of RF was further compared with that of other supervised learning techniques viz., SVM, RIPPER, C4.5 and Naïve Bayes classifier, which have been used by Weitschek et al. (Weitschek et al., 2014a) for species identification. Species identification success rate of RF and other supervised learning techniques are shown in Fig. 2. It is observed that the species identification success rates of RF are 100%, 99.14%, 96.03% and 93.45% for Fish, Drosophila, Cypraeidae and Inga respectively, which are higher than that of SVM, RIPPER, C4.5 and Naïve Bayes classifiers. However, in Bats, the identification success rates for SVM, RIPPER and Naïve Bayes are the same i.e., 100%, which is 0.7% and 1.85% higher than that of RF and C4.5 respectively. In Birds, SVM performed much better than other four methods. Further, it is seen that identification success rates for RF and Naïve Bayes are N90% in all the categories, whereas it is b90% for SVM and C4.5 in Inga, and for RIPPER it is b 90% in Cypraidae, Inga and Bird (Fig. 2). Further, it is observed that overall accuracy (averaged over categories) of RF is 96.13%, which is at par with that of SVM (96.06) and Naïve Bayes (95.48), but higher than that of RIPPER (90.81%) and C4.5 (92.34%) classifiers. 3.3. Performance analysis using ED-II Using the parameters given under “performance analysis using EDI”, the RF model was also trained and validated using ED-II dataset.

P.K. Meher et al. / Gene 592 (2016) 316–324

319

Table 2 Misclassification error rates in different categories under multi-class RF model fitted with different values of k, default mtry and ntree = 500. k

#

1 2 3 4 1, 2 1, 3 1, 4 2, 3 2, 4 3, 4 1, 2, 3 1, 2, 4 1, 3, 4 2, 3, 4 1, 2, 3, 4

41 42 43 44 41+ 42 41+ 43 41+ 44 42+ 43 42+ 44 43 + 44 41 + 42+ 43 41 + 42+ 44 41 + 43+ 44 42 + 43+ 44 41 + 42 + 43+ 44

Features

Cypraeidae

Drosophila

Fish

Inga

Bat

Bird

Average

0.283 0.105 0.074 0.066 0.105 0.073 0.065 0.078 0.069 0.069 0.076 0.069 0.067 0.067 0.068

0.064 0.022 0.014 0.008 0.030 0.010 0.008 0.012 0.008 0.008 0.016 0.010 0.012 0.012 0.010

0.087 0.019 0.014 0.010 0.019 0.008 0.008 0.012 0.010 0.010 0.012 0.012 0.010 0.008 0.010

0.078 0.038 0.033 0.036 0.038 0.032 0.034 0.034 0.034 0.032 0.033 0.034 0.033 0.034 0.033

0.167 0.053 0.036 0.026 0.056 0.033 0.027 0.036 0.026 0.027 0.035 0.027 0.029 0.027 0.026

0.405 0.249 0.113 0.087 0.253 0.115 0.087 0.120 0.086 0.081 0.119 0.086 0.086 0.085 0.093

0.181 0.081 0.047 0.039 0.084 0.045 0.038 0.049 0.039 0.038 0.048 0.040 0.039 0.039 0.040

#

Number of features for corresponding values of k. Bold font denotes lowest overall misclassification rate.

Besides, the species identification success rate of RF was also compared with that of well established DNA Barcode classification techniques viz., phylogenetic trees (NJ, PAR), similarity-based approaches (NN, BLAST), and character-based approaches (DNA-BAR, BLOG). Species identification success rate of these methods are plotted in bar diagrams and shown in Fig. 3. It can be seen that the identification success rate is highest for RF in all the three taxonomic entities. Diagnostic method BLOG 2.0 (93.13%) performed next to RF (97.05%) in terms of overall identification success rate (averaged over the taxonomic entities), followed by diagnostic method DNA-BAR (90.43%). It is further observed that the species identification success rates of BLOG 2.0 and RF are N 90% in all the three cases (Drosophila, Inga and Cypraeidae), which is not the case with the other classifiers. Besides, it is also observed that the overall success rates of diagnostic methods

(DNA-BAR, BLOG 2.0) and RF are above 90%, whereas it is b 90% in both similarity- and tree-based methods (Fig. 3). 3.4. Performance analysis using SD-I The performance of the proposed approach (RF classifier with k-mer frequencies features) was also assessed using simulation-generated dataset and compared with other supervised learning approaches i.e., SVM, C4.5, RIPPER and Naïve Bayes classifiers. Species identification success rate of RF and other supervised learning approaches are shown in Fig. 4. The identification success rates of RF are observed to be 96.62% and 94.94% for the effective population sizes 1000 and 50,000 respectively, which are higher than that of other supervised classifiers (Fig. 4). However, identification success rates for Naïve Bayes, SVM and RF are seen to be at par and higher than that of C4.5 and RIPPER, when the effective population size is 10,000. It is further seen that the overall species identification success rate of RF (96.11%) is little higher than that of SVM (95.74%) and Naïve Bayes (95.24%) classifiers, but ~ 2% higher than that of RIPPER (93.60%) and ~ 3.5% higher than that of C4.5 (92.86%) (Fig. 4). Besides, it is also noticed that the overall accuracies of SVM, Naïve Bayes, C4.5 and RF are consistent with that of empirical dataset ED-I. Based on SD-I, the species identification success rate of RF was also compared with that of similarity-based, tree-based and diagnosticbased approaches. It is seen that the species identification success rate of RF is ~ 95% for all the three effective population sizes, and is ~ 10% higher than that of others (Fig. 5). Further, it is observed that success rates of NN, BLAST, NJ, DNA-BAR and BLOG 2.0 are almost similar (~ 85%) for all the three effective population sizes (Fig. 5). Moreover, the success rate is found to be lowest for PAR (b 80%). Though, identification success rate of similarity-based, tree-based, diagnostic-based approaches are seen to be higher in empirical dataset (ED-II, Fig. 3) as compared to simulated dataset (SD-I, Fig. 5), the trend in accuracy is found to be almost same in both the cases i.e., in the order: diagnostic-based N similarity-based N tree-based. Also, it is seen that the species identification success rates for supervised learning are higher in simulated dataset (SD-I, Fig. 2) than empirical dataset (ED-I, Fig. 4) with some exceptions. 3.5. Performance analysis of BLOG 2.0 on ED-I

Fig. 1. Bar plots of OOB error rates with respect to number of classification trees in Random forest. There are 500 classification trees in each category, each bar represents OOB error at an interval of 10 classification trees. It can be seen that OOB error got almost stabilized after 10 bars i.e., 100 classification trees in each category.

BLOG 2.0 is the latest tool for species identification using DNA barcodes and has been achieved higher identification success rate than DNA-BAR as well as tree-based and similarity-based approaches. Thus, its performance was further compared with the proposed approach using ED-I (as the performance of BLOG 2.0 has already been tested on ED-II and SD-I). The species identification success rates

320

P.K. Meher et al. / Gene 592 (2016) 316–324

Fig. 2. Species identification success rates of existing supervised learning approaches and proposed approach, while dataset ED-I is used. It can be seen that except Bat and Bird the species identification success rates of RF are higher than that of other approaches. In Bird, SVM performed better than the others, whereas in Bat SVM, RIPPER and Naïve Bayes achieved 100% species identification success rate.

Fig. 3. Species identification success rates of rule-based approaches along with the proposed approach, while dataset ED-II is used. It can be observed that species identification success rate is higher in RF than that of rule-based approaches.

P.K. Meher et al. / Gene 592 (2016) 316–324

321

of RF and BLOG 2.0 are shown in Fig. 6. From the figure, it is seen that the species identification success rates of RF are higher than that of BLOG 2.0 in all the six categories. It is also observed that overall accuracy of RF is 96.13% and is ~3% higher than the overall accuracy of BLOG 2.0 (93.32%) (Fig. 6).

3.6. Analysis of running time The times taken by different approaches are given in Table 3. Among machine learning classifiers, time taken by RF is higher as compared to the others and is lowest for C4.5. One of the possible reasons for this is that RF is an ensemble of several tree classifiers whereas C4.5 is a single tree based classifier. On the other hand, RF took less time than that of BLOG 2.0, which is better in terms of accuracy among rule based approaches. Further, the time taken to execute the BLAST and “degenbar” are also higher than that of RF. In the tree based approaches, time taken to generate NJ tree is less than that of parsimonious tree (PAR).

3.7. Online species identification server

Fig. 4. Species identification success rates of existing supervised learning approaches and proposed approach, while dataset SD-I is used. The species identification success rates of RF, SVM and Naïve Bayes are almost equal and little higher than that of RIPPER and C4.5.

A web interface named as SPIDBAR has been developed to help the taxonomist community for identifying the specimens using DNA barcodes and is freely available for academic users at http://cabgrid. res.in:8080/spidbar. In training-result-file, the number of observed individual belongs to a certain species and the number of correctly identified individual of that species are provided. In test-result-file, hypothetical (or observed) species label supplied by the user and predicted species label of each query barcodes are provided. Also, the links for downloading the result files are provided. The snapshot of the prediction server is presented in Fig. 7a, and the result page after the execution of an example dataset is shown in Fig. 7b.

Fig. 5. Species identification success rates of rule-based approaches and proposed approach, while dataset SD-I is used. The species identification success rate of RF is much higher than that of other approaches, for all the three different effective population sizes.

322

P.K. Meher et al. / Gene 592 (2016) 316–324

Fig. 6. Species identification success rates of BLOG 2.0 and proposed approach based on ED-I dataset. It can be seen that the species identification success rates of proposed approach are more than that of BLOG 2.0 in all the six categories.

4. Discussion Species identification is important for preserving species diversity as well as to monitor the biological implication of changing environment. DNA barcoding has been proposed as a global standard for identifying the unknown specimens to its species (Rydberg, 2010). The concept of barcoding was popularized by Hebert et al. (Hebert et al., 2003), who proposed to use the mitochondrial COI as barcode for animals. Significant efforts have been put by the CBOL for promoting the collection of DNA barcodes as well as development of computational methods for correctly identifying the species by analyzing its barcode sequence (Bertolazzi et al., 2009). From computational point of view, after retrieving the barcode sequence from an unknown species a computational algorithm is used to compare it to a reference library containing the barcode sequences of known species, thus enabling them to be assigned with a species label. This paper presents a computational approach for assigning a known species label to the barcode sequence of an unknown species by comparing its barcode sequence with the reference library. In this approach, the barcode sequences with pre-assigned species labels were first mapped onto feature vectors based on k-mer frequencies and were then used as input in the multi-class RF model to identify the unknown species. The species identification success rate of proposed approach (RF classifier with k-mer features) was assessed and compared with that of other supervised learning based approaches, tree-based approaches, similarity-based approaches and diagnostic-based approaches, by using two empirical and one simulated datasets available in public domain. In the existing supervised learning based methods (Weitschek et al., 2014a), the sequences are encoded based on sparse encoding i.e., A → 1, T → 4, C → 2 and G → 3. The number of features generated by this encoding proposal is not invariant to the length of the barcode

sequence, whereas the number of feature generated based on k-mer frequencies is invariant to the sequence length. Besides, the number of features used in existing supervised learning based approaches is much higher than the number of features (i.e., 260) used in the present study. The species identification success rate of RF was found at par with that of SVM and Naïve Bayes but higher than that of C4.5 and RIPPER, while ED-I was used. However, in simulated dataset (SD-I), only the RF achieved N 96% accuracy, and for SVM and Naïve Bayes it was below 96%. Further, C4.5 was found to perform better than RIPPER in case of ED-I but RIPPER outperformed C4.5 in SD-I. Thus, it implies that RF is more consistent as compared to the other considered supervised learning approaches. Based on ED-II, the overall species identification success rate of RF was found to be 97.05%, which was ~10% higher than that of tree- and similarity-based approaches and ~5% higher than that of diagnostic approaches. Similar to empirical dataset (ED-II), RF outperformed ad-hoc (similarity-, tree- and diagnostic-based) approaches in simulated dataset (SD-I) as well. It was also seen that diagnostic approaches outperformed both tree- and similarity-based approaches. Overall, it was found that that the species identification success rate of RF was higher than that of existing supervised learning and ad-hoc approaches. Overall, it was found that the species identification success rate was higher in supervised learning approaches (including proposed approach) as compared to the rule-based approaches. However, to train and validate a supervised learning model an adequate reference library is required; otherwise the model may suffer from the problem of under-fitting or over-fitting (Weitschek et al., 2014a). Besides, to carryout cross validation analysis adequate number of barcode sequences per species (at least K sequence for K-fold validation) is required. Moreover, training and validation of supervised learning model with thousands of species and millions of sequences may be computationally

Table 3 Time taken (in seconds) by different approaches for species identification based on DNA barcodes and specification of the system used for computation of time. Categories

Bat Inga Fish Drosophila Cypraeidae Bird

Tree/Similarity/Diagnostic approaches PAR

NN

BLAST

BLOG 2.0

DNA-BAR

RF

SVM

Naïve Bayes

RIPPER

C4.5

7.78 9.5 6.19 6.39 32.63 10.86

46.43 4.54 16.62 10.78 93.21 65.24

1.16 1.18 6.28 2.73 1.56 3.27

22 132 56.03 37 144 98

560.24 339.74 302.94 201.84 1847.65 1076.16

38.81 17.61 18.59 28.94 46.68 27.15

14.63 14.49 14.4 6.9 57.97 44.83

6.08 6.13 3.42 3.07 15.94 17.65

10.18 7.06 8.5 4.28 33.84 20.42

13.71 10.18 8.26 3.6 67.75 44.94

3.82 2.42 2.99 1.8 5.14 5.21

Ubuntu 11.04a

Windows Seven Royale XP SP3 2010

System specification CPU model Intel(R) Core(TM) 2 Duo Processor 2.00 GHz RAM 2GB OS Windows Seven Royale XP SP3 2010 a

Machine learning based approaches

NJ

The “degenbar” package used for DNA-BAR is available in Linux environment. Thus, Ubuntu 11.04 Linux OS was used with the same system configuration.

P.K. Meher et al. / Gene 592 (2016) 316–324

Fig. 7. (a) Snapshot of home page of the SPIDBAR and (b) result page after execution of an example dataset.

323

324

P.K. Meher et al. / Gene 592 (2016) 316–324

expensive. Furthermore, supervised learning approaches do not provide information on diagnostic positions and nucleotide assignments as found in rule-based approaches i.e., RIPPER, BLOG 2.0 and DNA-BAR (Weitschek et al., 2014a). Specifically, the information on logic formulae obtained from BLOG 2.0 can be used in species characterization, molecular detection etc. (Weitschek et al., 2014a). Though the proposed approach achieved higher identification success rate, it did not provide a comprehensible interpretable model like diagnostic-based approaches. Therefore, it can be said that if classification accuracy is priority then SPIDBAR may be preferred over other methods, whereas if species level information is required (i.e., the location of the alleles that characterize the species and logic formula associated with the species) diagnostic-based approach can be used over the others. Furthermore, the time taken to execute the RF classifier is much lesser than that of BLOG 2.0, which is the best performer among rule based approaches. Thus, we believe that both SPIDBAR and diagnostic-based approaches will complement each other for identifying the matching species by analyzing the DNA barcode of an unknown species. 5. Conclusion This paper provides a computational approach for assigning an unknown specimen to a known species by analyzing its DNA barcode. The proposed approach was found comparable with well known DNA barcode based species identification approaches, while compared in terms of species identification success rate by using simulated and real datasets. Thus, the proposed approach is believed to supplement the existing approaches for assigning the species label to an unknown species by analyzing its barcode sequence. Based on this approach, a web server named as SPIDBAR has also been developed and made publicly available at http://cabgrid.res.in:8080/spidbar/ that will help enable the taxonomists and other researchers working in the area of species identification using DNA barcode. For reproducible research, all the empirical and simulated datasets used in this study are made available at http://cabgrid.res.in:8080/ spidbar/dataset, which were collected from http://dmb.iasi.cnr.it/blog. php. Supplementary data associated with this article can be found in the online version, at doi. 10.1016/j.gene.2016.07.010 Competing interests Authors' declared that they have no competing interest. Authors' contributions PKM conceived the study. ARR and PKM designed the study. PKM and TKS collected and processed the sequence dataset. PKM developed the prediction approach. TKS and PKM developed the web server. PKM, ARR and TKS drafted the manuscript. ARR refined and finalized the manuscript. All authors read and approved the final manuscript. Acknowledgements The grant (Agril.Edn.4-1/2013-A&P dated 11.11.2014) received from Indian Council of Agriculture Research (ICAR) for Centre for Agricultural

Bioinformatics (CABin) scheme of Indian Agricultural Statistics Research Institute (IASRI) is duly acknowledged. References Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410. Austerlitz, F., David, O., Schaeffer, B., Bleakley, K., Olteanu, M., Leblois, R., Veuille, M., Laredo, C., 2009. DNA barcode analysis: a comparison of phylogenetic and statistical classification methods. BMC Bioinforma. 14 (Suppl. 10), S10. Bertolazzi, P., Felici, G., Weitschek, E., 2009. Learning to classify species with barcodes. BMC Bioinforma. 10 (Suppl. 14), S7. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. Breiman, L., Freidman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Trees. Chapman and Hall, New York. Chase, M.W., Cowan, R.S., Hollingsworth, P.M., van den Berg, C., Madrián, S., Petersen, G., Seberg, O., Jorgsensen, T., Cameron, K.M., Carine, M., 2007. A proposal for a standardized protocol to barcode all land plants. Taxon 56, 295–299. Chu, K.H., Xu, M., Li, C.P., 2009. Rapid DNA barcoding analysis of large datasets using the composition vector method. BMC Bioinforma. 10 (Suppl. 14), S8. Das Gupta, B., Konwar, K.M., Măndoiu, I.I., Shvartsman, A.A., 2005. DNA-BAR: distinguisher selection for DNA barcoding. Bioinformatics 21 (16), 3424–3426. Dinca, V., Zakharov, E.V., Hebert, P.D.N., Vila, R., 2011. Complete DNA barcode reference library for a country's butterfly fauna reveals high performance for temperate Europe. Proc. R. Soc. B 278, 347–355. Hebert, P.D.N., Cywinska, A., Ball, S.L., DeWaard, J., 2003. Biological identifications through DNA barcodes. Proc. R. Soc. B 270, 313–321. Hollingsworth, P.M., Forrest, L.L., Spouge, J.L., et al., 2009. A DNA barcode for land plants. Proc. Natl. Acad. Sci. U. S. A. 106, 12794–12797. Kress, W.J., Erickson, D.L., 2007. A two-locus global DNA barcode for land plants: the coding rbcL gene complements the non-coding trnH-psbA spacer region. PLoS One 2, e508. Kress, W.J., Wurdack, K.J., Zimmer, E.A., Weigt, L.A., Janzen, D.H., 2005. Use of DNA barcodes to identify flowering plants. Proc. Natl. Acad. Sci. U. S. A. 102, 8369–8374. Li, J.L., Wang, L.F., Wang, H.Y., Bai, L.Y., Yuan, Z.M., 2012. High-accuracy splice site prediction based on sequence component and position features. Genet. Mol. Res. 11 (3), 3432–3451. Liaw, A., Wiener, M., 2002. Classification and regression by randomForest. R News 2, 18–22. Liu, C., Liang, D., Gao, T., et al., 2011. PTIGS-IdIt, a system for species identification byDNA sequences of the psbA-trnH intergenic spacer region. BMC Bioinforma. 12 (Suppl. 13), S4. Paradis, E., Claude, J., Strimmer, K., 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289–290. Polychronopoulos, D., Weitschek, E., Dimitrieva, S., Bucher, P., Felici, G., Almirantis, Y., 2014. Classification of selectively constrained DNA elements using feature vectors and rule-based classifiers. Genomics 104 (2), 79–86. Ratnasingham, S., Hebert, P.D.N., 2007. BOLD: the barcode of life data system. Mol. Ecol. Notes 7, 355–364. Rydberg, A., 2010. DNA Barcoding as a Tool for the Identification of Plant Material: A Case Study on the Medicinal Roots Traded in the Medina of Marrakech Master Thesis Department of systematic biology, Uppsala University. Seifert, K.A., 2009. Progress towards DNA barcoding of fungi. Mol. Ecol. Resour. 9, 83–89. Shaw, J., Lickey, E.B., Schilling, E.E., Small, R.L., 2007. Comparison of whole chloroplast genome sequences to choose non-coding regions for phylogenetic studies in angiosperms: the tortoise and the hare III. Am. J. Bot. 94, 275–288. Tamura, K., Stecher, G., Peterson, D., Filipski, A., Kumar, S., 2013. MEGA6: molecular evolutionary genetics analysis version 6.0. Mol. Biol. Evol. 30 (12), 2725–2729. Van Velzen, R., Weitschek, E., Felici, G., Bakker, F.T., 2012. DNA barcoding of recently diverged species: relative performance of matching methods. PLoS One 7 (1), e30490. Weitschek, E., van Velzen, R., Felici, G., Bertolazzi, P., 2013. BLOG 2.0: a software system for character-based species classification with DNA barcode sequences: what it does, how to use it. Mol. Ecol. Resour. 13 (6), 1043–1046. Weitschek, E., Fiscon, G., Felici, G., 2014a. Supervised DNA barcodes species classification: analysis, comparisons and results. BioData Min. 7, 4. Weitschek, E., Cunial, F., Felici, G., 2014b. Classifying bacterial genomes on k-mer frequencies with compact logic formulas. Proceedings of 25th International Workshop on Database and Expert Systems Applications, pp. 69–73 (ISSN: 1529–4188). Zhang, X., Lee, J., Chasin, L.A., 2006. The effect of nonsense codons on splicing: a genomic analysis. RNA 9, 637–639.