which can be any subsequence of the full query sequence (i.e. the query can be ..... prec. unassigned superkingdom phylum class order family genus species. 0.
Supplementary Methods for Taxator-tk: Precise Taxonomic Assignment of Metagenomes by Fast Approximation of Evolutionary Neighborhoods I.
Taxonomic Assignment of Sequence Segments Here we describe in detail the individual steps and the run-time properties of the
algorithm which is implemented in the program taxator, the second stage of the overall binning workflow using taxator-tk (Fig. 2b). We propose the realignment placement algorithm (RPA) for the taxonomic assignment of a query segment q, which can be any subsequence of the full query sequence (i.e. the query can be a read, contig, scaffold or a complete genome sequence). The algorithm constitutes two pairwise alignment passes and in each, q is aligned to segments of nucleotide reference sequences. It aims at identifying as many as possible taxa of the prediction clade (node R in Fig. 2a) without explicitly resolving its phylogenetic structure. 1. Among the given set of homologous segments constructed from overlapping alignments before application of the RPA, we define s to be the most similar segment to q, i.e. the one with the best local alignment score of all reference segments. In the first pass, all segments are aligned against s ( alignments). The resulting pairwise scores, our implementation uses the edit distance (mismatches + gaps), define an ordering among all segments or their corresponding taxa. The distinction between segments and associated taxa will be neglected in the following for better readability. All taxa which are less distant to s than q, including s itself, are added to an empty set M which holds all identified taxa of the prediction clade. The first more distant taxon than q is defined to be the outgroup segment o (Fig. 2c) and used as the alignment target in the following second and last pass in which similar taxa to o are added M. 2. We align all segments, including q, against o and rank the resulting scores. Then we add all taxa to M which have a lower score than q. With some fine-tuning, we chose to also add taxa with a higher score than q, within a small range accounting for erroneous scores, because o and q can be very distant homologs with noisy alignment. The width of this error band is determined on a per-segment basis as a linear score function of the taxonomic disorder in the alignment scores and not a universal or configurable run-time parameter. We interpret a rank disorder (e.g. a known family member of o being more similar to o than a
taxator-tk Supplementary Material Page 1 of 51
corresponding species member segment) as a discordance between gene tree and taxonomy and proportionally scale the effective score of q to enlarge M by taxa which are slightly more distant to o than q. This second pass requires new alignments, or less if some segments are identical to either q or s. If multiple best references (s) or outgroup segments (o) were present in these two passes with identical alignment scores, the calculations are repeated for every such segment in order to produce stable output. We reduced the additional computational effort in our implementation by detecting frequent identical segments and uninformative homologs. The final assignment taxon ID of q is the lowest common ancestor (LCA) of the taxa in M, or none if no outgroup had been found. The theoretical run-time in the segment assignment algorithm measured in units “number of pairwise alignments” is in
and about
, where
denotes the number of
homologous segments. The run-time complexity for a single pairwise alignment is and scales quadratically with the segment length . Therefore the total run-time complexity per segment is
and the total worst-case run-time for the entire
query sequence can be bounded above by number
where
denotes the maximum
of segment homologs among all query segments and
is the total length
of the query sequence. Thus, the run-time for the entire sample in the worst case scales linearly with the amount of sequence data (bp) and linearly with the number of homologs but quadratically with the length of the individual segments. Segments with an excessive number of homologs, most often short segments of abundant and uninformative regions, have a negative impact on the program run-time. We currently limit the number of homologs per query to the top-scoring 50 by default in our pipeline scripts (configurable run-time parameter in program alignments-filter or directly in the local alignment search program), before passing them to taxator. Other tested values gave similar results and the parameter, if changed, should be chosen based on hardware limitations. If this parameter is set lower, then the number of reference segments drops below a critical value such that no outgroup can be determined for some q and which therefore remain unassigned (but without impacting the taxon ID of other segments). II.
Consensus Binning Algorithm Due to sparse segments and taxonomic assignment thereof with taxator in stage
two of the workflow (Fig. 1b), a final processing step (Fig. 1c) is required to determine
taxator-tk Supplementary Material Page 2 of 51
a taxon ID for the entire query sequence. Therefore we have implemented a simplistic, weighted consensus assignment scheme in the program binner, which optionally permits to apply custom constraints, e.g. the minimum percentage identity (PID) for classification at the species level or the removal of taxa with low counts in the whole sample. However, there are currently only two mandatory run-time parameters to control the actual post-processing consensus algorithm. First we define the support of a query segment to be the number of total identical positions to the best reference segment. The first run-time parameter specifies the minimum combined support at any rank (50 positions by default) and serves to ignore false predictions caused by short and often noisy segments. The other parameter specifies the minimum percentage of the summed support (70% by default) to allow a majority taxon to outvote a contradicting minority. Inconsistent taxa below this support value are resolved by the LCA operation until the threshold is reached. Probably due to the conservative nature of the RPA, we found those two parameters to have minimal impact on the binning results in practice. The output of taxator additionally includes the taxa in the evolutionary neighborhood, a score reflecting the agreement between the segment tree and the taxonomy, as well as a score for interpolation of the query-branch location between the R and X nodes of Fig. 2. We provide Python language bindings for processing with other applications. III.
Taxonomy and Phylogeny Taxator-tk assumes that the NCBI taxonomy used for the assignment correctly
captures the evolutionary process of speciation, although we know that the categorization of some taxa might be inconsistent with their evolution. If the phylogenetic information inferred from similarity scores disagrees with the taxonomic structure, assignments are made to a consistent higher rank. For instance horizontal gene transfer and upstream sequence misassembly can cause multiple similar copies of a sequence to be distributed across unrelated taxa. In case a query sequence cannot be traced by the algorithm to have evolved with either copy, it is usually assigned to the LCA of these clades. However, if the donor clade is unknown, the query may also be assigned to the recipient clade and the horizontal transfer or misassembly can go undetected. Thus assignment errors caused by the evolution of genes, upstream technical errors or taxonomy cannot always be eliminated in this framework. It remains to be assessed whether the use of an alternative microbial taxonomy such as the GreenGenes 1 or the SILVA2 taxonomy would improve on the
taxator-tk Supplementary Material Page 3 of 51
taxonomic assignment. IV. Comparison and Innovations Taxator-tk shares some ideas with previous programs: Starting with MEGAN3, which uses local alignments scores to define a "neighborhood of related sequences" and then makes a taxonomic estimate which is the LCA of the corresponding taxa. This neighborhood threshold is a percentage of the local alignment score and can be interpreted to reflect the rate of evolution within a taxonomic group. Its value is empirical and lacks stronger justification. The neighborhood definition has been improved in taxator-tk and other programs. To our knowledge, SOrt-ITEMS4 was the first algorithm to use the logic of realignment to the best reference (termed reciprocal similarity) for read assignment but is restricted to protein level alignment and is implemented as a wrapper around (the legacy C version of) BLAST+3. Protein-level alignment in general triples the run-time of the local alignment step (translation into three frame shifts) and cannot make use of faster nucleotide aligners. SOrt-ITEMS also uses fixed similarity thresholds in terms of percentage identity to define universal levels of conservation within taxonomic groups assuming the same rate of evolution for different genetic regions and clades. Furthermore SOrt-ITEMS was primarily designed for reads and if it performs well for longer sequences, its run-time is expected to increase proportionally with input sequence lengths. Both follow-up programs taxator-tk and CARMA36 adopted the logic of reciprocal alignment, extended it and removed the assumption of universal conservation levels. CARMA3 accounts for a heterogeneous rate of evolution for different genetic regions. The initial identification of similar sequences in the reference can be based on nucleotide or protein BLAST search or profile Hidden Markov Models with HMMER7. In BLAST mode, CARMA3, like SOrt-ITEMS, uses a single reciprocal alignment search and then extra or interpolates alignment scores to select a taxonomic rank for prediction. It therefore assumes a parameterized model for the conservation level at a taxonomic rank: a linear function which is fitted to the observed local alignment scores. With taxator-tk, we use a non-parametric score ranking algorithm, instead. Also, to our knowledge, we provide the first algorithm to determine a proper outgroup and to sparsify the input data being able to assign distinct regions on the query sequence to possibly different taxonomic groups. Also, we at most assume segment-wise constant
taxator-tk Supplementary Material Page 4 of 51
rates of evolution (equally long branches from a common ancestor). This makes the major algorithmic component parameter-less and robust in itself, independent of the individual segment sizes. Through the sparsification procedure it incorporates structural rearrangements among distant relatives and scales better with the length of the input sequences. The individual segment assignments allow for a robust consensus voting scheme for the assignment of entire sequence fragments. The segment-specific classifications could also be used to detect the inconsistent taxonomic composition of an input sequence which can be caused by horizontal gene transfers events (HGTs) and assembly errors. Different from most previous approaches, taxator-tk was developed for and tested using fast nucleotide sequence local alignments instead of protein sequence alignments, although for the local alignments in stage 1 of the workflow both can be used. Our comparisons, however, suggest that the additional computations which are required for protein-level homology search do not considerably improve the results with taxator-tk. Thus, taxonomic binning of a metagenome sample with taxator-tk requires no more than specification of reference sequences, their taxonomic affiliations and an aligner like BLAST or LAST8. On the implementation side, all workflow steps for taxonomic assignment with taxator-tk are designed in a modular way making it easy to save, compress, reuse or recompute results. The computation-intensive classification of segments in taxator is run in parallel on many CPU cores while at the same time using the open source C++ algorithm library SeqAn 9 for fast pairwise alignment. V.
Performance Measures As metagenome datasets can have varying taxonomic composition in terms of
which taxa are present and their relative abundances, this needs to be taken into consideration in evaluating taxonomic assignment methods. If an algorithm performs better for some clades than for others at a given rank we call it taxonomically biased. Oftentimes a classifier is biased, if it uses parameters that fit one clade better than another. This can be the case if the parameters were chosen to give good overall assignment accuracy (low total number of false predictions) on training data with biased taxonomic composition. Such a method is optimized to perform well for the abundant taxa of these particular training data and will not generalize well when applied to a sample of different taxonomic structure and abundances. To account for uneven taxonomic composition in evaluation datasets and to obtain comparable performance estimates across datasets of different taxonomic composition, we used
taxator-tk Supplementary Material Page 5 of 51
as the primary evaluation measure the bin-averaged precision (or positive predictive value), also known as macro-precision. (Equation V.1) where
is number of all predicted bins and (Equation V.2)
True positives
are the correct assignments to the
bin and false positives
the incorrect assignments to the same bin. The macro-precision is the fraction of correct sequence assignments over all assignments to a given taxonomic bin, averaged over all predicted bins for a given rank. For falsely predicted bins which do not occur in the data, the precision is therefore zero. This value reflects how trustworthy the bin assignments are on average from a user’s perspective, as it is averaged overall predicted bins. In addition to the macro-precision, we report the raw numbers of true and false predictions for every cross-validation, as well as a quick overall precision for pooled ranks. This overall precision is most informative for species+genus+family and reports the fraction of true classifications among the predictions for all these ranks in a single pooled bin. (Equation V.3)
We measure the taxonomic bias of a method in terms of the standard deviation over all individual bin precisions. (Equation V.4) where (Equation V.5) The standard deviation is small if all predicted bins have a similar precision. A universally good method should have a high macro-precision with a low taxonomic
taxator-tk Supplementary Material Page 6 of 51
bias. The recall (or sensitivity) is a measure of completeness of a predicted bin and, analogously, the macro-recall is the fraction of correctly assigned sequences of all sequences belonging to a certain bin, averaged over all existing bins in the test data10. (Equation V.6) where
is the number of all existing bins in the test data and (Equation V.7)
False negatives (
) are the assignments belonging to the
th
bin but which where
classified to another bin or left unassigned. The macro-recall reflects how well the classifier works more from a developer’s perspective than from the user's perspective, as it is usually not known which predicted bins correspond to existing ones and which do not. VI. Low-abundance Filtering The number of predicted bins at each rank can be quite large, at most the number of known taxa in the taxonomy and reference sequence data. When noise is considered to occur evenly distributed across this large output space, bins with few assigned sequences are more likely to be falsely identified, than larger bins (the chance to independently classify the same bin by chance n times is
, where
is
the number of possible bins). Since the macro precision is an average over all predicted bins, it is heavily affected by bins with few sequences assigned. As a result, classifiers that predict clades present at low frequencies in the sample score badly under this measure. To correct for this effect, we define a truncated average precision ignoring the least abundant predicted bins and consider only the largest predicted bins constituting a minimum fraction
of the total assignments (equal size bins
are also included). This modification acts as a noise filter and accounts for different behavior of classifiers without explicitly considering the size of the model space or the number of existing species in the actual sample. We set evaluations.
taxator-tk Supplementary Material Page 7 of 51
to 0.99 for our
VII. Cross-validation Despite
the
limitations
of
simulated
metagenomes,
which
incorporate
assumptions about sequencing error rates or species abundance distributions, it is very informative to evaluate taxonomic assignment methods on simulated sequence data as real metagenome samples lack taxon IDs for evaluation. Our canonical way of evaluating a method on simulated data is a version of leave-one-out crossvalidation: Each query sequence is classified by removing all identical or related sequences up to a given rank from the reference collection: For example, to assess the performance in assigning query sequences from a new species, all sequences belonging to this species are removed from the reference sequence collection for the classifier. Performance measures (macro-recall, macro-precision), along with other statistics (true/false/unassigned data, overall precision, bin counts) which are available in the coupled tables, were normally calculated in units of the number of assigned basepairs or the number of assigned sequences, if these had comparable lengths. These values were calculated for all ranks (species, genus, family, order, class, phylum, domain/superkingdom) for seven simulations: either all reference data was used (per query) or all data from the query species, genus, family, order, class or phylum was removed from the reference data prior to classification. The assignments of these seven cross-validation experiments were averaged for a combined performance summary with standard measures. VIII. Consistency Analysis In order to evaluate the predictions for real metagenome samples where no underlying correct taxon IDs are known for the sequences, we assigned sequences linked by assembly and calculated an assignment consistency value. We split long contigs into multiple pieces and classified each piece independently. Assuming that the sequence assembly was correct in the first place, contradicting assignments of pieces that originate from the same contig represent false assignments. This unveils part of the errors made by a particular method but some, if not the majority, will go undetected because the actual ID stays unknown and the assignments for a contig can be consistently wrong. Hence these results are generally more difficult to interpret than those from simulated data. IX. Sequence Homology Search via Local Alignment In the course of evaluation we created many local alignments as input to the
taxator-tk Supplementary Material Page 8 of 51
taxonomic assignment programs CARMA3, MEGAN4/5 and taxator-tk. The nucleotide alignments were mostly generated using the alignment program LAST (version 320) because it ran faster without noticeable differences in the output alignments than BLAST+/blastn (version 2.2.28+). The protein-level alignments which we used in our evaluations were generated with BLAST+/tblastx (version 2.2.28+) because we wanted to compare with identical nucleotide reference sequences. We support and tested with different alignment programs for the fact that BLAST is standard and easy to parallelize whereas LAST has a faster algorithm but high memory requirements. It ran with comparable speed to the BLAST+/megablast algorithm which has a limited sensitivity and in practice resulted in a two to four times reduced amount of query sequences being aligned and classified. For a detailed comparison of alignment programs and how LAST compares to other programs such as RAPSEARCH211 and BLAT12, consider Niu et al.13 and Darling et al.14. In our evaluations, LAST was roughly 50 to 200 times faster than BLAST+/blastn and about as fast as BLAST+/megablast (which has much reduced sensitivity). LAST is also tunable for better sensitivity with protein-coding nucleotide sequences using a special form of seeding. If other alignment programs are found to be better-suited for a particular data type, these can easily be incorporated into the provided workflows. For instance, local protein sequence alignments can be performed in the homology search step, e.g. by using BLAST+/tblastx. There are fast aligners such as RAPSEARCH2, PAUDA15 and DIAMOND16 that allow searching for homologs in large reference collections of amino acid sequences. To produce compatible input for taxator-tk, the amino acid alignment positions must be converted into nucleotide positions. For our short sequence length evaluation (Supplementary Fig. S6-S8), evaluation of a published SimMC scenario (Supplementary Fig. S21) and evaluation of a simulated metagenome sample with 49 species (Fig. 3, Supplementary Fig. S11S13), we used a standard BLAST+/blastn (version 2.2.28+) and BLAST+/tblastx search. We chose the default alignment parameters and scoring schemes with each aligner. The generated alignments were then provided in BLAST tabular format to be usable with CARMA3 and MEGAN4/MEGAN5. Taxator-tk reads a simplistic tabseparated alignment format that can be generated directly with BLAST+ or with conversion scripts which we provide for the MAF alignment format of LAST. This arrangement ensures that taxator-tk can be easily adapted to profit from
taxator-tk Supplementary Material Page 9 of 51
advancements in the field of local alignment in future. Users can also employ amino acid level alignment if the final output is mapped back to positions on the nucleotide reference and query sequences. The easiest way to achieve this is to use BLAST+/tblastx although this is computationally more demanding than directly searching a collection of protein sequences for which also nucleotide sequences are available. X.
Program Parameters and Versions For taxonomic assignment with MEGAN4 (version 4.70.4) we used minscore=20,
toppercent=20, minsupport=5 and mincomplexity=0.44 parameters. With MEGAN5 (version 5.4.3), we used the default options minsupport=10, minscore=50, max_expected=0.01, minimal_coverage_heuristic=on and top_percent=20, as with MEGAN4. In CARMA3, we used the standard parameters in the contained configuration file. Kraken (version 0.10.4b) was also applied with the standard commands and without shrinking the database (shrink_db.sh). Taxator-tk (version 1.1.1-extended) was run with standard settings, being restricted to the 50 best scoring local alignments to avoid long run-times for some of the query sequences. This is purely a convenience filter at the current state of development and is meant to be replaced by an adaptive per-segment heuristic. XI. 16S Cross-validation We evaluated the performance of taxator-tk in classifying the most widely used taxonomic marker gene in studies of microbial diversity, the 16S rRNA gene, as a proof of concept. For our evaluation, we extracted 7,175 annotated 16S rRNA genes (Suppl. Fig. 5) each with a minimum length of 1 kb from mRefSeq47 (Suppl. Fig. 9). The sequences were assigned with taxator-tk using the entire mRefSeq as reference, not just 16S genes. The cross-validation assesses the performance of 16S gene assignment in a wide range of situations. The performance statistics were calculated based on the number of assigned sequences, as all have comparable length. When using the complete reference sequences, 87% of sequences were assigned to the ranks of species, genus and family with 100% accuracy (Supplementary Fig. S3b), the remaining 13% were correctly assigned at higher ranks. This is an ideal situation showing the baseline on our dataset (in terms of the assigned rank depth). In more realistic simulations, when we tested assignment of genes from novel species or novel higher-level clades, assignments were accordingly made to higher ranks in
taxator-tk Supplementary Material Page 10 of 51
most cases. For instance, when simulation novels species, 2,678 contigs were assigned to the correct genera, while 491 erroneous species and genus assignments were made. The macro-precision in the combined cross-validation (Fig. 2) was always above 92%, with standard deviations from 10 to 25%, which demonstrates a good and even performance of taxator-tk for all clades in the case of 16S rRNA data. XII. FAMeS Cross-validation On the FAMeS contig datasets, taxator-tk produced fewer errors for all taxonomic ranks than MEGAN4, which was accompanied by a moderate reduction in macrorecall throughout all individual experiments and in the combined cross-validation experiments: For SimMC, the macro-precision was three to four times as large as MEGAN4's for species to order, with higher macro-recall (Supplementary Fig. S17S18). The species to family overall precision was ~91% for taxator-tk (~59% for MEGAN4) and taxator-tk estimated 54 species bins (MEGAN4 188) for the 47 actual species in SimMC. Similarly, for SimHC, taxator-tk achieved a higher macro-precision for all ranks, which was most pronounced for class and phylum (Supplementary Fig. S19-S20). By contrast, the macro-recall was slightly reduced and both methods underestimated the 96 existing species in SimHC. XIII. Supplementary Files The PDF attachment includes informative interactive charts and files which are necessary to reproduce the results which are shown in the article. Larger benchmark data can be downloaded from http://algbio.cs.uni-duesseldorf.de/software/. Supplementary Methods References 1. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–72 (2006). 2. Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590–6 (2013). 3. Huson, D. H., Auch, A. F., Qi, J. & Schuster, S. C. MEGAN analysis of metagenomic data. Genome Res. 17, 377–86 (2007). 4. Monzoorul Haque, M., Ghosh, T. S., Komanduri, D. & Mande, S. S. SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics 25, 1722–30 (2009). 5. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
taxator-tk Supplementary Material Page 11 of 51
6. Gerlach, W. & Stoye, J. Taxonomic classification of metagenomic shotgun sequences with CARMA3. Nucleic Acids Res. 1–11 (2011). 7. Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39 Suppl 2, W29–37 (2011). 8. Frith, M. C., Hamada, M. & Horton, P. Parameters for accurate genome alignment. BMC Bioinformatics 11, 80 (2010). 9. SeqAn. at http://www.seqan.de 10. McHardy, A. C., Martín, H. G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4, 63–72 (2007). 11. Zhao, Y., Tang, H. & Ye, Y. RAPSearch2: a fast and memory-efficient protein similarity search tool for next generation sequencing data. Bioinformatics 28, 125–126 (2011). 12. Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res. 12, 656–64 (2002). 13. Niu, B., Zhu, Z., Fu, L., Wu, S. & Li, W. FR-HIT, a very fast program to recruit metagenomic reads to homologous reference genomes. Bioinformatics 27, 1704–5 (2011). 14. Darling, A. E. et al. PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2, e243 (2014). 15. Huson, D. H. & Xie, C. A poor man’s BLASTX--high-throughput metagenomic protein database search using PAUDA. Bioinformatics 30, 38–9 (2014). 16. Buchfink, B., Xie, C. & Huson, D. H. Fast and Sensitive Protein Alignment using DIAMOND, under review.
taxator-tk Supplementary Material Page 12 of 51
Supplementary Figure S1: Query sequence segmentation and segment splicing
Query and corresponding reference segments from local alignment region extension and splicing. Blue bars correspond to original local alignment regions on reference nucleotide sequences which are positionally aligned to the query nucleotide sequence in red. These alignments are generated by a local (nucleotide) sequence aligner such as BLAST or LAST before running taxator. If alignments overlap on the query, they are joined into query segments which are flanked by regions without detected similarity to any known reference sequence. Reference segments are constructed from the original alignment reference regions (blue) by extension (gray bars) with the same number of nucleotides which are missing to match the length of the query segment. The corresponding sets of homologs are the input to the core taxonomic assignment algorithm in taxator.
taxator-tk Supplementary Material Page 13 of 51
Supplementary Figure S2: Taxonomic assignment of segments
unassigned class family genus
Four long contigs of the SimMC data-set. Colored boxes show segments that where assigned by taxator, when all species reference data was removed (new species simulation). White regions in between lack alignments by the local alignment search and have therefore no homologs for assignment. All assigned regions in this example are consistently assigned at the taxonomic ranks genus, family and class. The shown segments are used by the program binner to derive consistent whole-sequence taxonomic assignments, as done in our evaluations.
taxator-tk Supplementary Material Page 14 of 51
(a) summary scenario
Supplementary Figure S3 - 16S gene assignment with taxator-tk true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 2.6 2.6
274.4 2159.6 869.1 1417.9 420.1 471.6 636.7 623.8 6598.8 6873.3
0.0 0.0 66.6 92.6 69.4 26.1 65.0 83.0 402.7 402.7
macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 95.5 0 96.1 0 95.4 0 95.7 0 95.8 0 92.6 0 95.9 0 96.4
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 13.8 10.7 12.6 13.5 16.1 24.5 13.0 11.4
1 2 13 25 62 148 342 570 166.0 145.4
100.0 82.7 33.5 27.3 20.7 16.0 9.2 5.1 27.8 36.8
0.0 14.2 23.6 18.4 14.2 11.9 8.9 6.7 14.0 12.2
1 2 32 52 109 235 615 1416 351.6 307.8
sum true (sequences)
sum false (sequences)
overall prec.
description
4593.6
0.0
100.0
root+superkingdom
2707.1
228.6
92.2
phylum+class+order
1732.1
174.1
90.9
family+genus+species
94.2 94.5
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 5.0 5.0
10 80 113 428 272 750 1779 3743 7165 7175
0 0 0 0 0 0 0 0 0 0
5,000
80 70
4,000
60 50
3,000
40 2,000
30 20
1,000
10 phylum
class
order
family
genus
species
false (sequences) true (sequences) macro precision α=0.99 macro recall
0
% macro-precision and macro-recall
6,000
90
number of assigned sequences
% macro-precision and macro-recall
100
unassigned superkingdom
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 4.6 4.6
22 313 347 989 948 1350 2678 0 6625 6647
0 0 2 8 9 18 64 427 528 528
0 0 0 0 0 0 0 0 0 0
(c) new species scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 3.9 7.9 7.3 18.9 0.0 5.4 4.7
1 2 14 26 54 91 88 54 47.0 41.3
100.0 96.7 54.8 53.9 44.2 28.8 10.8 0.0 41.3 48.6
0.0 3.2 39.5 41.6 39.6 37.5 26.3 0.0 26.8 23.5
1 2 32 52 109 235 615 1416 351.6 307.8
sum true (sequences)
sum false (sequences)
overall prec.
description
648
0
100.0
root+superkingdom
2284
19
99.2
phylum+class+order
4028
509
88.8
family+genus+species
92.6 92.6
all but unassigned all with unassigned
70
4,000
60 50
3,000
40 2,000
30 20
1,000
10 class
order
1 2 32 52 109 235 615 1416 351.6 307.8
sum true (sequences)
sum false (sequences)
overall prec.
description
170
0
100.0
root+superkingdom
813
0
100.0
phylum+class+order
6272
0
100.0
family+genus+species
100.0 100.0
all but unassigned all with unassigned
6,000 5,000
80 70
4,000
60 50
3,000
40 2,000
30 20
1,000
10 0
unassigned superkingdom
phylum
class
order
family
genus
species
false (sequences) true (sequences) macro precision α=0.99 macro recall
0
family
genus
(d) new genus scenario
Supplementary Figure S3 - 16S gene assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
species
0
false (sequences) true (sequences) macro precision α=0.99 macro recall
% macro-precision and macro-recall
5,000
80
number of assigned sequences
% macro-precision and macro-recall
6,000
90
phylum
real bins
0.0 2.4 38.3 37.2 39.6 40.4 48.0 46.8 36.1 31.6
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 3.3 3.3
48 804 1098 2392 1190 1201 0 0 6685 6733
0 0 2 8 9 36 344 43 442 442
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 100.0 100.0 98.5 95.4 78.8 0.0 0.0 67.5 71.6
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 4.7 17.8 39.2 0.0 0.0 8.8 7.7
1 2 12 22 48 59 34 8 26.4 23.3
100.0 96.5 46.8 36.3 25.2 11.7 0.0 0.0 30.9 39.6
0.0 3.0 38.9 35.7 32.9 26.7 0.0 0.0 19.6 17.2
1 2 32 52 109 235 615 1416 351.6 307.8
sum true (sequences)
sum false (sequences)
overall prec.
description
1656
0
100.0
root+superkingdom
4680
19
99.6
phylum+class+order
1201
423
74.0
family+genus+species
93.8 93.8
all but unassigned all with unassigned
taxator-tk on RefSeq 16S genes
100
unassigned superkingdom
stdev
100.0 97.6 78.3 77.6 72.4 71.0 53.9 35.8 69.5 73.3
90
taxator-tk on RefSeq 16S genes
0
macro recall
1 2 16 29 67 158 337 504 159.0 139.3
taxonomic rank
Supplementary Figure S3 - 16S gene assignment with taxator-tk rank
pred. bins
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
100
taxonomic rank
macro precision α=0.99 100.0 100.0 100.0 98.9 98.8 98.2 95.2 0.0 84.4 86.4
stdev
taxator-tk on RefSeq 16S genes
taxator-tk on RefSeq 16S genes
0
macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0
number of assigned sequences
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
100
6,000
90 5,000
80 70
4,000
60 50
3,000
40 2,000
30 20
1,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 15 of 51
family
genus
species
0
number of assigned sequences
rank
(b) all reference scenario
Supplementary Figure S3 - 16S gene assignment with taxator-tk
false (sequences) true (sequences) macro precision α=0.99 macro recall
(e) new family scenario
Supplementary Figure S3 - 16S gene assignment with taxator-tk true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 2.4 2.4
299 1321 1442 3485 531 0 0 0 6779 7078
0 0 2 11 15 38 17 14 97 97
macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 100.0 0 97.7 0 69.6 0 0.0 0 0.0 0 0.0 0 52.5 0 58.4
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 7.2 42.6 0.0 0.0 0.0 7.1 6.2
1 2 7 13 28 24 9 3 12.3 10.9
100.0 82.8 25.4 15.0 3.4 0.0 0.0 0.0 18.1 28.3
0.0 13.7 35.2 26.2 12.1 0.0 0.0 0.0 12.5 10.9
1 2 32 52 109 235 615 1416 351.6 307.8
sum true (sequences)
sum false (sequences)
overall prec.
description
2941
0
100.0
root+superkingdom
5458
28
99.5
phylum+class+order
0
69
0.0
family+genus+species
98.6 98.6
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 2.1 2.1
424 1920 1665 2631 0 0 0 0 6216 6640
0 0 2 12 434 74 2 11 535 535
6,000
90 5,000
80 70
4,000
60 50
3,000
40 2,000
30 20
1,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (sequences) true (sequences) macro precision α=0.99 macro recall
0
% macro-precision and macro-recall
100
0
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 1.2 1.2
549 4734 1419 0 0 0 0 0 6153 6702
0 0 67 390 3 9 1 3 473 473
0 0 0 0 0 0 0 0 0 0
(g) new class scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 15.4 0.0 0.0 0.0 0.0 0.0 2.2 1.9
1 2 4 8 8 6 2 1 4.4 4.0
100.0 73.7 9.8 0.0 0.0 0.0 0.0 0.0 11.9 22.9
0.0 19.6 23.9 0.0 0.0 0.0 0.0 0.0 6.2 5.4
1 2 32 52 109 235 615 1416 351.6 307.8
sum true (sequences)
sum false (sequences)
overall prec.
description
10017
0
100.0
root+superkingdom
1419
460
75.5
phylum+class+order
0
13
0.0
family+genus+species
92.9 93.4
all but unassigned all with unassigned
70
4,000
60 50
3,000
40 2,000
30 20
1,000
10 class
order
1 2 32 52 109 235 615 1416 351.6 307.8
sum true (sequences)
sum false (sequences)
overall prec.
description
4264
0
100.0
root+superkingdom
4296
448
90.6
phylum+class+order
0
87
0.0
family+genus+species
92.1 92.5
all but unassigned all with unassigned
6,000 5,000
80 70
4,000
60 50
3,000
40 2,000
30 20
1,000
10 0
unassigned superkingdom
phylum
class
order
family
genus
species
false (sequences) true (sequences) macro precision α=0.99 macro recall
0
family
genus
(h) new phylum scenario
Supplementary Figure S3 - 16S gene assignment with taxator-tk rank
depth 0 1 2 3 4 5 6 7 1.1 1.1
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
species
0
false (sequences) true (sequences) macro precision α=0.99 macro recall
% macro-precision and macro-recall
5,000
80
number of assigned sequences
% macro-precision and macro-recall
6,000
90
phylum
real bins
0.0 22.3 31.6 20.8 0.0 0.0 0.0 0.0 10.7 9.3
true (sequences)
false (sequences)
569 5945 0 0 0 0 0
0 0 391 219 16 8 27
5945 6514
661 661
macro unknown (sequences) precision stdev α=0.99 0 100.0 0.0 0 100.0 0.0 0 0.0 0.0 0 0.0 0.0 0 0.0 0.0 0 0.0 0.0 0 0.0 0.0 nan nan 0 16.7 0.0 0 28.6 0.0
pred. bins
macro recall
stdev
real bins
1 1 5 8 7 5 2 0 4.0 3.6
100.0 58.5 0.0 0.0 0.0 0.0 0.0 0.0 8.4 19.8
0.0 35.3 0.0 0.0 0.0 0.0 0.0 0.0 5.0 4.4
1 2 32 52 109 235 615 1416 351.6 307.8
sum true (sequences)
sum false (sequences)
overall prec.
description
12459
0
100.0
root+superkingdom
0
626
0.0
phylum+class+order
0
35
0.0
family+genus+species
90.0 90.8
all but unassigned all with unassigned
taxator-tk on RefSeq 16S genes
100
unassigned superkingdom
stdev
100.0 72.9 19.1 7.9 0.0 0.0 0.0 0.0 14.3 25.0
90
taxator-tk on RefSeq 16S genes
0
macro recall
1 2 6 8 17 11 3 1 6.9 6.1
taxonomic rank
Supplementary Figure S3 - 16S gene assignment with taxator-tk rank
pred. bins
0.0 0.0 0.1 1.1 0.0 0.0 0.0 0.0 0.2 0.1
100
taxonomic rank
macro precision α=0.99 100.0 100.0 90.3 0.0 0.0 0.0 0.0 0.0 27.2 36.3
stdev
taxator-tk on RefSeq 16S genes number of assigned sequences
% macro-precision and macro-recall
taxator-tk on RefSeq 16S genes
macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 100.0 0 99.6 0 0.0 0 0.0 0 0.0 0 0.0 0 42.8 0 49.9
number of assigned sequences
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
100
6,000
90 5,000
80 70
4,000
60 50
3,000
40 2,000
30 20
1,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 16 of 51
family
genus
species
0
number of assigned sequences
rank
(f) new order scenario
Supplementary Figure S3 - 16S gene assignment with taxator-tk
false (sequences) true (sequences) macro precision α=0.99 macro recall
Supplementary Figure S4: Taxonomic composition of microbial RefSeq 47
all
Taxonomic composition down to family level of the microbial (bacteria, archaea and viruses) portion of the RefSeq47 sequence data collection using Krona (Ondov et al., 2011). An interactive version can be found in the supplementary files (RefSeq47.krona.html). Abundance is measured in terms of accumulated sequence lengths per clade.
taxator-tk Supplementary Material Page 17 of 51
Supplementary Figure S5: Taxonomic composition of 16S genes extracted from RefSeq47
all
Taxonomic composition down to genus level of the 16S benchmark dataset using Krona (Ondov et al., 2011). The dataset was simulated by extracting every annotated 16S gene in RefSeq47 which was at least 1000 bp long. An interactive version can be found in the supplementary files (refseq16S.krona.html). Abundance is measured as the number of 16S genes.
taxator-tk Supplementary Material Page 18 of 51
Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 2.4 1.5
37391.6 32272.9 4563.7 2164.1 2249.9 2859.3 7852.3 2808.6 54770.7 92162.3
0.0 427.3 2340.3 1120.1 1535.0 591.9 1275.7 547.4 7837.7 7837.7
macro unknown (sequences) precision α=0.99 0 100.0 0 99.2 0 83.0 0 82.0 0 86.5 0 85.1 0 87.3 0 74.0 0 85.3 0 87.1
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 10.2 13.1 11.1 14.7 17.6 34.8 14.5 12.7
1 1 11 23 52 98 202 431 116.9 102.4
100.0 26.4 9.3 8.9 7.8 5.8 3.5 1.0 8.9 20.3
0.0 26.7 8.5 7.6 7.2 6.8 5.6 2.6 9.3 8.1
1 3 32 52 110 240 656 1697 398.6 348.9
sum true (sequences)
sum false (sequences)
overall prec.
description
101937.3
427.3
99.6
root+superkingdom
8977.7
4995.4
64.2
phylum+class+order
13520.1
2415.0
84.8
family+genus+species
87.5 92.2
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 4.1 3.5
10662 18479 4362 2607 4629 8015 31586 19660 89338 100000
0 0 0 0 0 0 0 0 0 0
50,000
70
40,000
60 50
30,000
40 20,000
30 20
10,000
false (sequences) true (sequences) macro precision α=0.99 macro recall
10 0
unassigned superkingdom
phylum
class
order
family
genus
% macro-precision and macro-recall
90
number of assigned sequences
% macro-precision and macro-recall
60,000
80
rank
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 3.4 2.6
22319 27291 5362 3240 4611 7973 23380 0 71857 94176
0 252 746 327 468 255 776 3000 5824 5824
0 0 0 0 0 0 0 0 0 0
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 1.8 2.9 3.3 6.7 21.6 0.0 5.2 4.5
1 1 10 22 45 75 100 217 67.1 58.9
100.0 35.1 17.5 18.1 15.4 10.9 5.0 0.0 14.6 25.3
0.0 32.5 18.3 19.1 18.5 18.3 14.2 0.0 17.3 15.1
1 3 32 52 110 240 656 1697 398.6 348.9
sum true (sequences)
sum false (sequences)
overall prec.
description
76901
252
99.7
root+superkingdom
13213
1541
89.6
phylum+class+order
31353
4031
88.6
family+genus+species
92.5 94.2
all but unassigned all with unassigned
70
40,000
60 50
30,000
40 20,000
30 20
10,000
10 order
100.0 48.0 35.2 35.7 33.9 27.8 19.2 6.9 29.5 38.4
0.0 37.0 28.1 27.1 28.2 29.2 28.2 18.2 28.0 24.5
1 3 32 52 110 240 656 1697 398.6 348.9
overall prec.
description
47620
0
100.0
root+superkingdom
11598
0
100.0
phylum+class+order
59261
0
100.0
family+genus+species
100.0 100.0
all but unassigned all with unassigned
50,000
70
40,000
60 50
30,000
40 20,000
30 20
10,000
false (sequences) true (sequences) macro precision α=0.99 macro recall
10 unassigned superkingdom
phylum
class
order
family
genus
0
species
Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
family
genus
species
0
false (sequences) true (sequences) macro precision α=0.99 macro recall
% macro-precision and macro-recall
50,000
80
number of assigned sequences
% macro-precision and macro-recall
90
class
1 2 12 24 54 104 211 365 110.3 96.6
sum false (sequences)
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 2.1 1.4
34909 37775 6814 3906 3846 4027 0 0 56368 91277
0 343 1406 689 962 657 4422 244 8723 8723
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 99.3 82.6 82.0 80.6 49.4 0.0 0.0 56.3 61.7
(d) new genus scenario
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 22.0 17.8 17.2 39.3 0.0 0.0 13.8 12.0
1 1 8 19 44 77 193 103 63.6 55.8
100.0 27.2 6.9 6.1 4.4 1.7 0.0 0.0 6.6 18.3
0.0 28.0 9.4 7.9 7.5 5.3 0.0 0.0 8.3 7.3
1 3 32 52 110 240 656 1697 398.6 348.9
sum true (sequences)
sum false (sequences)
overall prec.
description
110459
343
99.7
root+superkingdom
14566
3057
82.7
phylum+class+order
4027
5323
43.1
family+genus+species
86.6 91.3
all but unassigned all with unassigned
taxator-tk on simulated 100bp sequences 60,000
phylum
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
sum true (sequences)
60,000
taxator-tk on simulated 100bp sequences
unassigned superkingdom
real bins
80
0
(c) new species scenario
100
0
stdev
taxonomic rank
Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
macro recall
90
taxonomic rank
macro precision α=0.99 100.0 99.6 97.2 97.2 97.3 96.1 90.4 0.0 82.5 84.7
pred. bins
100
0
species
stdev
taxator-tk on simulated 100bp sequences
taxator-tk on simulated 100bp sequences 100
macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0
(b) all reference scenario
number of assigned sequences
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk
100
60,000
90 50,000
80 70
40,000
60 50
30,000
40 20,000
30 20
10,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 19 of 51
family
genus
species
0
number of assigned sequences
rank
(a) summary scenario
false (sequences) true (sequences) macro precision α=0.99 macro recall
Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 1.7 1.0
40215 40076 6632 3425 2663 0 0 0 52796 93011
0 525 1627 904 1309 1045 1413 166 6989 6989
macro unknown (sequences) precision α=0.99 0 100.0 0 98.9 0 80.7 0 59.6 0 31.9 0 0.0 0 0.0 0 0.0 0 38.7 0 46.4
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 7.2 27.6 33.3 0.0 0.0 0.0 9.7 8.5
1 1 6 14 43 120 133 80 56.7 49.8
100.0 22.2 2.9 2.2 0.9 0.0 0.0 0.0 4.0 16.0
0.0 27.0 5.6 3.9 2.5 0.0 0.0 0.0 5.6 4.9
1 3 32 52 110 240 656 1697 398.6 348.9
sum true (sequences)
sum false (sequences)
overall prec.
description
120367
525
99.6
root+superkingdom
12720
3840
76.8
phylum+class+order
0
2624
0.0
family+genus+species
88.3 93.0
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 1.5 0.9
44037 38701 5817 1971 0 0 0 0 46489 90526
0 563 3314 1414 2189 961 873 160 9474 9474
50,000
70
40,000
60 50
30,000
40 20,000
30 20
10,000
false (sequences) true (sequences) macro precision α=0.99 macro recall
10 0
unassigned superkingdom
phylum
class
order
family
genus
% macro-precision and macro-recall
90 80
rank
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 1.5 0.7
50150 36288 2959 0 0 0 0 0 39247 89397
0 579 4203 2365 1894 692 742 128 10603 10603
0 0 0 0 0 0 0 0 0 0
(g) new class scenario
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 24.4 0.0 0.0 0.0 0.0 0.0 3.5 3.0
1 1 7 20 49 100 109 63 49.9 43.8
100.0 17.9 0.7 0.0 0.0 0.0 0.0 0.0 2.7 14.8
0.0 22.8 2.0 0.0 0.0 0.0 0.0 0.0 3.5 3.1
1 3 32 52 110 240 656 1697 398.6 348.9
sum true (sequences)
sum false (sequences)
overall prec.
description
122726
579
99.5
root+superkingdom
2959
8462
25.9
phylum+class+order
0
1562
0.0
family+genus+species
78.7 89.4
all but unassigned all with unassigned
70
40,000
60 50
30,000
40 20,000
30 20
10,000
10 order
100.0 20.3 1.8 0.6 0.0 0.0 0.0 0.0 3.2 15.3
0.0 25.5 4.0 1.5 0.0 0.0 0.0 0.0 4.4 3.9
1 3 32 52 110 240 656 1697 398.6 348.9
overall prec.
description
121439
563
99.5
root+superkingdom
7788
6917
53.0
phylum+class+order
0
1994
0.0
family+genus+species
83.1 90.5
all but unassigned all with unassigned
50,000
70
40,000
60 50
30,000
40 20,000
30 20
10,000
false (sequences) true (sequences) macro precision α=0.99 macro recall
10 unassigned superkingdom
phylum
class
order
family
genus
species
0
Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
family
genus
species
0
false (sequences) true (sequences) macro precision α=0.99 macro recall
% macro-precision and macro-recall
50,000
80
number of assigned sequences
% macro-precision and macro-recall
90
class
1 1 6 18 49 106 118 72 52.9 46.4
sum false (sequences)
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 1.7 0.7
59449 27300 0 0 0 0 0 0 27300 86749
0 729 5086 2142 3923 533 704 134 13251 13251
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 97.9 0.0 0.0 0.0 0.0 0.0 0.0 14.0 24.7
(h) new phylum scenario
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 11 20 44 98 105 60 48.4 42.5
100.0 14.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 14.3
0.0 18.6 0.0 0.0 0.0 0.0 0.0 0.0 2.7 2.3
1 3 32 52 110 240 656 1697 398.6 348.9
sum true (sequences)
sum false (sequences)
overall prec.
description
114049
729
99.4
root+superkingdom
0
11151
0.0
phylum+class+order
0
1371
0.0
family+genus+species
67.3 86.7
all but unassigned all with unassigned
taxator-tk on simulated 100bp sequences 60,000
phylum
0.0 0.0 21.5 23.8 0.0 0.0 0.0 0.0 6.5 5.7
sum true (sequences)
60,000
taxator-tk on simulated 100bp sequences
unassigned superkingdom
real bins
taxonomic rank
100
0
stdev
80
0
Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
macro recall
90
taxonomic rank
macro precision α=0.99 100.0 98.6 27.7 0.0 0.0 0.0 0.0 0.0 18.0 28.3
pred. bins
100
0
species
stdev
taxator-tk on simulated 100bp sequences 60,000
number of assigned sequences
% macro-precision and macro-recall
taxator-tk on simulated 100bp sequences 100
macro unknown (sequences) precision α=0.99 0 100.0 0 98.8 0 60.2 0 23.3 0 0.0 0 0.0 0 0.0 0 0.0 0 26.0 0 35.3
(f) new order scenario
number of assigned sequences
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk
100
60,000
90 50,000
80 70
40,000
60 50
30,000
40 20,000
30 20
10,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 20 of 51
family
genus
species
0
number of assigned sequences
rank
(e) new family scenario
false (sequences) true (sequences) macro precision α=0.99 macro recall
Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 2.3 1.8
20001.1 39599.4 7862.7 3756.1 3446.6 3162.4 7880.9 3622.7 69330.9 89332.0
0.0 582.0 3532.7 1555.3 2138.3 702.6 1428.4 728.7 10668.0 10668.0
macro unknown (sequences) precision α=0.99 0 100.0 0 99.1 0 84.1 0 81.8 0 85.1 0 84.6 0 87.6 0 76.5 0 85.6 0 87.4
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 12.4 14.8 13.1 17.2 19.3 34.0 15.8 13.8
1 1 12 24 56 104 212 480 127.0 111.3
100.0 53.6 13.2 12.1 10.2 7.1 4.2 1.4 14.5 25.2
0.0 26.8 11.4 9.4 8.6 7.8 6.3 3.4 10.5 9.2
1 2 32 52 110 240 656 1693 397.9 348.3
sum true (sequences)
sum false (sequences)
overall prec.
description
99200.0
582.0
99.4
root+superkingdom
15065.4
7226.3
67.6
phylum+class+order
14666.0
2859.7
83.7
family+genus+species
86.7 89.3
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 4.0 3.6
7999 17442 4415 2699 4636 7889 29561 25359 92001 100000
0 0 0 0 0 0 0 0 0 0
40,000
70 60
30,000
50 40
20,000
30 20
10,000
false (sequences) true (sequences) macro precision α=0.99 macro recall
10 0
unassigned superkingdom
phylum
class
order
family
genus
% macro-precision and macro-recall
90
number of assigned sequences
% macro-precision and macro-recall
50,000
80
rank
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 3.5 3.1
10520 28378 8027 4991 6476 9009 25605 0 82486 93006
0 224 773 337 481 253 910 4016 6994 6994
0 0 0 0 0 0 0 0 0 0
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.5 1.2 2.6 2.8 6.6 21.1 0.0 5.0 4.3
1 2 11 23 50 79 107 237 72.7 63.8
100.0 68.6 24.7 24.9 21.0 14.2 6.3 0.0 22.8 32.5
0.0 21.6 23.6 23.3 22.3 21.7 16.4 0.0 18.4 16.1
1 2 32 52 110 240 656 1693 397.9 348.3
sum true (sequences)
sum false (sequences)
overall prec.
description
67276
224
99.7
root+superkingdom
19494
1591
92.5
phylum+class+order
34614
5179
87.0
family+genus+species
92.2 93.0
all but unassigned all with unassigned
40,000
70 60
30,000
50 40
20,000
30 20
10,000
10 order
100.0 81.8 43.0 43.7 40.6 32.6 23.0 9.5 39.2 46.8
0.0 10.7 30.7 29.3 30.9 31.8 32.0 23.5 27.0 23.6
1 2 32 52 110 240 656 1693 397.9 348.3
overall prec.
description
42883
0
100.0
root+superkingdom
11750
0
100.0
phylum+class+order
62809
0
100.0
family+genus+species
100.0 100.0
all but unassigned all with unassigned
40,000
70 60
30,000
50 40
20,000
30 20
10,000
false (sequences) true (sequences) macro precision α=0.99 macro recall
10 unassigned superkingdom
phylum
class
order
family
genus
0
species
Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
family
genus
species
0
false (sequences) true (sequences) macro precision α=0.99 macro recall
% macro-precision and macro-recall
80
number of assigned sequences
% macro-precision and macro-recall
90
class
1 2 14 27 59 109 221 408 120.0 105.1
sum false (sequences)
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 2.2 1.9
13884 43450 12278 7723 7610 5239 0 0 76300 90184
0 357 1548 760 1032 761 5064 294 9816 9816
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 99.5 95.9 87.2 86.3 52.7 0.0 0.0 60.2 65.2
(d) new genus scenario
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 2.9 13.6 12.4 40.4 0.0 0.0 9.9 8.7
1 1 9 21 47 81 136 105 57.1 50.1
100.0 60.2 13.4 10.9 7.8 2.8 0.0 0.0 13.6 24.4
0.0 26.6 15.2 12.4 11.2 7.9 0.0 0.0 10.5 9.2
1 2 32 52 110 240 656 1693 397.9 348.3
sum true (sequences)
sum false (sequences)
overall prec.
description
100784
357
99.6
root+superkingdom
27611
3340
89.2
phylum+class+order
5239
6119
46.1
family+genus+species
88.6 90.2
all but unassigned all with unassigned
taxator-tk on simulated 500bp sequences 50,000
phylum
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
sum true (sequences)
50,000
taxator-tk on simulated 500bp sequences
unassigned superkingdom
real bins
80
0
(c) new species scenario
100
0
stdev
taxonomic rank
Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
macro recall
90
taxonomic rank
macro precision α=0.99 100.0 99.2 98.4 97.8 97.7 96.6 90.8 0.0 82.9 85.1
pred. bins
100
0
species
stdev
taxator-tk on simulated 500bp sequences
taxator-tk on simulated 500bp sequences 100
macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0
(b) all reference scenario
number of assigned sequences
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk
100
50,000
90 80
40,000
70 60
30,000
50 40
20,000
30 20
10,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 21 of 51
family
genus
species
0
number of assigned sequences
rank
(a) summary scenario
false (sequences) true (sequences) macro precision α=0.99 macro recall
Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 1.8 1.5
18731 47733 12581 7177 5404 0 0 0 72895 91626
0 702 2006 1099 1570 1251 1492 254 8374 8374
macro unknown (sequences) precision α=0.99 0 100.0 0 98.9 0 88.9 0 64.4 0 40.0 0 0.0 0 0.0 0 0.0 0 41.8 0 49.0
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 4.2 31.5 36.6 0.0 0.0 0.0 10.3 9.0
1 1 6 16 42 103 122 82 53.1 46.6
100.0 47.8 5.9 4.0 1.6 0.0 0.0 0.0 8.5 19.9
0.0 34.0 10.5 6.7 4.2 0.0 0.0 0.0 7.9 6.9
1 2 32 52 110 240 656 1693 397.9 348.3
sum true (sequences)
sum false (sequences)
overall prec.
description
114197
702
99.4
root+superkingdom
25162
4675
84.3
phylum+class+order
0
2997
0.0
family+genus+species
89.7 91.6
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 1.6 1.2
21881 49724 11475 3703 0 0 0 0 64902 86783
0 771 4991 1888 3483 1003 868 213 13217 13217
40,000
70 60
30,000
50 40
20,000
30 20
10,000
false (sequences) true (sequences) macro precision α=0.99 macro recall
10 0
unassigned superkingdom
phylum
class
order
family
genus
% macro-precision and macro-recall
90 80
rank
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 1.5 1.1
28770 49324 6263 0 0 0 0 0 55587 84357
0 838 6679 3676 2612 852 834 152 15643 15643
0 0 0 0 0 0 0 0 0 0
(g) new class scenario
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 28.9 0.0 0.0 0.0 0.0 0.0 4.1 3.6
1 1 7 21 52 108 103 56 49.7 43.6
100.0 39.9 1.5 0.0 0.0 0.0 0.0 0.0 5.9 17.7
0.0 31.6 3.8 0.0 0.0 0.0 0.0 0.0 5.1 4.4
1 2 32 52 110 240 656 1693 397.9 348.3
sum true (sequences)
sum false (sequences)
overall prec.
description
127418
838
99.3
root+superkingdom
6263
12967
32.6
phylum+class+order
0
1838
0.0
family+genus+species
78.0 84.4
all but unassigned all with unassigned
40,000
70 60
30,000
50 40
20,000
30 20
10,000
10 order
100.0 44.4 3.6 1.1 0.0 0.0 0.0 0.0 7.0 18.6
0.0 34.2 7.8 2.5 0.0 0.0 0.0 0.0 6.4 5.6
1 2 32 52 110 240 656 1693 397.9 348.3
overall prec.
description
121329
771
99.4
root+superkingdom
15178
10362
59.4
phylum+class+order
0
2084
0.0
family+genus+species
83.1 86.8
all but unassigned all with unassigned
40,000
70 60
30,000
50 40
20,000
30 20
10,000
false (sequences) true (sequences) macro precision α=0.99 macro recall
10 unassigned superkingdom
phylum
class
order
family
genus
species
0
Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
family
genus
species
0
false (sequences) true (sequences) macro precision α=0.99 macro recall
% macro-precision and macro-recall
80
number of assigned sequences
% macro-precision and macro-recall
90
class
1 1 6 18 56 106 110 63 51.4 45.1
sum false (sequences)
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 1.6 1.0
38223 41145 0 0 0 0 0 0 41145 79368
0 1182 8732 3127 5790 798 831 172 20632 20632
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 97.7 0.0 0.0 0.0 0.0 0.0 0.0 14.0 24.7
(h) new phylum scenario
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 11 22 44 106 93 45 46.0 40.4
100.0 32.6 0.0 0.0 0.0 0.0 0.0 0.0 4.7 16.6
0.0 28.9 0.0 0.0 0.0 0.0 0.0 0.0 4.1 3.6
1 2 32 52 110 240 656 1693 397.9 348.3
sum true (sequences)
sum false (sequences)
overall prec.
description
120513
1182
99.0
root+superkingdom
0
17649
0.0
phylum+class+order
0
1801
0.0
family+genus+species
66.6 79.4
all but unassigned all with unassigned
taxator-tk on simulated 500bp sequences 50,000
phylum
0.0 0.0 17.0 29.3 0.0 0.0 0.0 0.0 6.6 5.8
sum true (sequences)
50,000
taxator-tk on simulated 500bp sequences
unassigned superkingdom
real bins
taxonomic rank
100
0
stdev
80
0
Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
macro recall
90
taxonomic rank
macro precision α=0.99 100.0 98.5 34.0 0.0 0.0 0.0 0.0 0.0 18.9 29.1
pred. bins
100
0
species
stdev
taxator-tk on simulated 500bp sequences 50,000
number of assigned sequences
% macro-precision and macro-recall
taxator-tk on simulated 500bp sequences 100
macro unknown (sequences) precision α=0.99 0 100.0 0 98.8 0 71.7 0 31.9 0 0.0 0 0.0 0 0.0 0 0.0 0 28.9 0 37.8
(f) new order scenario
number of assigned sequences
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk
100
50,000
90 80
40,000
70 60
30,000
50 40
20,000
30 20
10,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 22 of 51
family
genus
species
0
number of assigned sequences
rank
(e) new family scenario
false (sequences) true (sequences) macro precision α=0.99 macro recall
Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 2.4 1.9
18217.3 37796.3 9465.1 4795.0 4417.0 3498.1 7834.3 3837.4 71643.3 89860.6
0.0 550.7 3300.7 1367.1 1966.7 817.1 1397.6 739.4 10139.4 10139.4
macro unknown (sequences) precision α=0.99 0 100.0 0 99.2 0 87.0 0 83.2 0 84.5 0 84.4 0 86.4 0 77.2 0 86.0 0 87.7
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 12.2 14.7 15.1 17.9 19.6 34.2 16.2 14.2
1 1 12 25 57 106 219 472 127.4 111.6
100.0 38.1 15.2 13.4 10.8 7.5 4.3 1.5 13.0 23.8
0.0 33.8 12.7 10.3 9.2 8.1 6.3 3.5 12.0 10.5
1 3 32 52 110 240 653 1690 397.1 347.6
sum true (sequences)
sum false (sequences)
overall prec.
description
93809.9
550.7
99.4
root+superkingdom
18677.1
6634.6
73.8
phylum+class+order
15169.9
2954.1
83.7
family+genus+species
87.6 89.9
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 4.0 3.6
7256 16867 4739 3024 4832 8132 28288 26862 92744 100000
0 0 0 0 0 0 0 0 0 0
40,000
70 60
30,000
50 40
20,000
30 20
10,000
false (sequences) true (sequences) macro precision α=0.99 macro recall
10 0
unassigned superkingdom
phylum
class
order
family
genus
% macro-precision and macro-recall
90
number of assigned sequences
% macro-precision and macro-recall
50,000
80
rank
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 3.5 3.3
7557 26333 9128 6028 7878 9803 26552 0 85722 93279
0 191 523 250 369 325 898 4165 6721 6721
0 0 0 0 0 0 0 0 0 0
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.2 0.6 2.4 4.4 9.0 22.4 0.0 5.6 4.9
1 2 12 24 52 83 107 230 72.9 63.9
100.0 49.7 28.6 28.3 23.0 15.4 6.5 0.0 21.6 31.4
0.0 38.2 25.3 25.2 23.6 22.6 16.7 0.0 21.7 19.0
1 3 32 52 110 240 653 1690 397.1 347.6
sum true (sequences)
sum false (sequences)
overall prec.
description
60223
191
99.7
root+superkingdom
23034
1142
95.3
phylum+class+order
36355
5388
87.1
family+genus+species
92.7 93.3
all but unassigned all with unassigned
40,000
70 60
30,000
50 40
20,000
30 20
10,000
10 order
100.0 55.1 45.2 44.7 40.6 33.1 23.3 10.1 36.0 44.0
0.0 39.9 30.5 29.7 31.1 32.0 32.2 24.8 31.5 27.5
1 3 32 52 110 240 653 1690 397.1 347.6
overall prec.
description
40990
0
100.0
root+superkingdom
12595
0
100.0
phylum+class+order
63282
0
100.0
family+genus+species
100.0 100.0
all but unassigned all with unassigned
40,000
70 60
30,000
50 40
20,000
30 20
10,000
false (sequences) true (sequences) macro precision α=0.99 macro recall
10 unassigned superkingdom
phylum
class
order
family
genus
0
species
Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
family
genus
species
0
false (sequences) true (sequences) macro precision α=0.99 macro recall
% macro-precision and macro-recall
80
number of assigned sequences
% macro-precision and macro-recall
90
class
1 2 14 27 59 112 221 402 119.6 104.8
sum false (sequences)
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 2.4 2.2
9974 39636 14673 9782 10445 6552 0 0 81088 91062
0 293 1006 538 874 955 4978 294 8938 8938
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 98.8 94.6 90.4 90.2 59.0 0.0 0.0 61.9 66.6
(d) new genus scenario
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.8 9.8 11.6 10.8 39.9 0.0 0.0 10.4 9.1
1 2 10 22 47 82 143 94 57.1 50.1
100.0 44.0 17.7 14.2 9.7 3.7 0.0 0.0 12.8 23.7
0.0 37.1 18.6 15.4 13.5 9.4 0.0 0.0 13.4 11.7
1 3 32 52 110 240 653 1690 397.1 347.6
sum true (sequences)
sum false (sequences)
overall prec.
description
89246
293
99.7
root+superkingdom
34900
2418
93.5
phylum+class+order
6552
6227
51.3
family+genus+species
90.1 91.1
all but unassigned all with unassigned
taxator-tk on simulated 1000bp sequences 50,000
phylum
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
sum true (sequences)
50,000
taxator-tk on simulated 1000bp sequences
unassigned superkingdom
real bins
80
0
(c) new species scenario
100
0
stdev
taxonomic rank
Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
macro recall
90
taxonomic rank
macro precision α=0.99 100.0 99.5 99.2 98.4 97.8 96.3 90.5 0.0 83.1 85.2
pred. bins
100
0
species
stdev
taxator-tk on simulated 1000bp sequences
taxator-tk on simulated 1000bp sequences 100
macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0
(b) all reference scenario
number of assigned sequences
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk
100
50,000
90 80
40,000
70 60
30,000
50 40
20,000
30 20
10,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 23 of 51
family
genus
species
0
number of assigned sequences
rank
(a) summary scenario
false (sequences) true (sequences) macro precision α=0.99 macro recall
Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 1.9 1.6
15429 44103 15330 9698 7764 0 0 0 76895 92324
0 644 1458 849 1404 1548 1507 266 7676 7676
macro unknown (sequences) precision α=0.99 0 100.0 0 99.1 0 94.1 0 79.4 0 45.8 0 0.0 0 0.0 0 0.0 0 45.5 0 52.3
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 2.4 24.5 38.6 0.0 0.0 0.0 9.4 8.2
1 1 6 15 40 104 127 65 51.1 44.9
100.0 35.3 8.0 5.4 2.1 0.0 0.0 0.0 7.3 18.9
0.0 36.3 13.7 9.1 5.6 0.0 0.0 0.0 9.3 8.1
1 3 32 52 110 240 653 1690 397.1 347.6
sum true (sequences)
sum false (sequences)
overall prec.
description
103635
644
99.4
root+superkingdom
32792
3711
89.8
phylum+class+order
0
3321
0.0
family+genus+species
90.9 92.3
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (sequences)
false (sequences)
0 1 2 3 4 5 6 7 1.7 1.3
19932 47714 14366 5033 0 0 0 0 67113 87045
0 725 4764 1596 3635 1133 883 219 12955 12955
40,000
70 60
30,000
50 40
20,000
30 20
10,000
false (sequences) true (sequences) macro precision α=0.99 macro recall
10 0
unassigned superkingdom
phylum
class
order
family
genus
% macro-precision and macro-recall
90 80
rank
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 1.5 1.1
28057 48587 8020 0 0 0 0 0 56607 84664
0 824 6696 3649 2300 939 835 93 15336 15336
0 0 0 0 0 0 0 0 0 0
(g) new class scenario
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 34.0 0.0 0.0 0.0 0.0 0.0 4.9 4.2
1 1 7 22 48 94 94 38 43.4 38.1
100.0 28.5 2.2 0.0 0.0 0.0 0.0 0.0 4.4 16.3
0.0 31.4 5.1 0.0 0.0 0.0 0.0 0.0 5.2 4.6
1 3 32 52 110 240 653 1690 397.1 347.6
sum true (sequences)
sum false (sequences)
overall prec.
description
125231
824
99.3
root+superkingdom
8020
12645
38.8
phylum+class+order
0
1867
0.0
family+genus+species
78.7 84.7
all but unassigned all with unassigned
40,000
70 60
30,000
50 40
20,000
30 20
10,000
10 order
100.0 32.4 5.0 1.5 0.0 0.0 0.0 0.0 5.6 17.4
0.0 34.7 10.2 3.5 0.0 0.0 0.0 0.0 6.9 6.0
1 3 32 52 110 240 653 1690 397.1 347.6
overall prec.
description
115360
725
99.4
root+superkingdom
19399
9995
66.0
phylum+class+order
0
2235
0.0
family+genus+species
83.8 87.0
all but unassigned all with unassigned
40,000
70 60
30,000
50 40
20,000
30 20
10,000
false (sequences) true (sequences) macro precision α=0.99 macro recall
10 unassigned superkingdom
phylum
class
order
family
genus
species
0
Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
family
genus
species
0
false (sequences) true (sequences) macro precision α=0.99 macro recall
% macro-precision and macro-recall
80
number of assigned sequences
% macro-precision and macro-recall
90
class
1 1 6 17 49 80 98 50 43.0 37.8
sum false (sequences)
depth
true (sequences)
false (sequences)
unknown (sequences)
0 1 2 3 4 5 6 7 1.6 1.0
39316 41334 0 0 0 0 0 0 41334 80650
0 1178 8658 2688 5185 820 682 139 19350 19350
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 97.7 0.0 0.0 0.0 0.0 0.0 0.0 14.0 24.7
(h) new phylum scenario
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 12 21 44 98 99 41 45.1 39.6
100.0 21.9 0.0 0.0 0.0 0.0 0.0 0.0 3.1 15.2
0.0 27.3 0.0 0.0 0.0 0.0 0.0 0.0 3.9 3.4
1 3 32 52 110 240 653 1690 397.1 347.6
sum true (sequences)
sum false (sequences)
overall prec.
description
121984
1178
99.0
root+superkingdom
0
16531
0.0
phylum+class+order
0
1641
0.0
family+genus+species
68.1 80.7
all but unassigned all with unassigned
taxator-tk on simulated 1000bp sequences 50,000
phylum
0.0 0.0 11.9 33.9 0.0 0.0 0.0 0.0 6.5 5.7
sum true (sequences)
50,000
taxator-tk on simulated 1000bp sequences
unassigned superkingdom
real bins
taxonomic rank
100
0
stdev
80
0
Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
macro recall
90
taxonomic rank
macro precision α=0.99 100.0 98.6 39.8 0.0 0.0 0.0 0.0 0.0 19.8 29.8
pred. bins
100
0
species
stdev
taxator-tk on simulated 1000bp sequences 50,000
number of assigned sequences
% macro-precision and macro-recall
taxator-tk on simulated 1000bp sequences 100
macro unknown (sequences) precision α=0.99 0 100.0 0 98.9 0 81.4 0 40.2 0 0.0 0 0.0 0 0.0 0 0.0 0 31.5 0 40.1
(f) new order scenario
number of assigned sequences
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk
100
50,000
90 80
40,000
70 60
30,000
50 40
20,000
30 20
10,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 24 of 51
family
genus
species
0
number of assigned sequences
rank
(e) new family scenario
false (sequences) true (sequences) macro precision α=0.99 macro recall
Supplementary Figure S9: Taxonomic composition of microbial RefSeq54
all
Taxonomic composition down to family level of the microbial (bacteria, archaea and viruses) portion of the RefSeq54 sequence data collection using Krona (Ondov et al., 2011). An interactive version can be found in the supplementary files (RefSeq54.krona.html). Abundance is measured in terms of accumulated sequence lengths per clade.
taxator-tk Supplementary Material Page 25 of 51
Supplementary Figure S10: Taxonomic composition of simArt49e
all
Taxonomic composition of the simulated metagenome sample simArt49e using Krona (Ondov et al., 2011). An interactive version can be found in the supplementary files (simArt49e.krona.html). Abundance is measured in terms of accumulated contigs lengths. The reads for this dataset were simulated using equal coverage for every strain, so differences in the data proportions result from a variable genome size and assembly bias.
taxator-tk Supplementary Material Page 26 of 51
Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 4.2 2.2
97460.6 8776.0 5011.0 4568.4 8358.4 10303.4 17193.4 31789.6 86000.3 183460.9
0.0 500.1 7085.1 9153.1 12086.7 12858.1 13745.6 28288.3 83717.1 83717.1
macro unknown precision (bp) α=0.99 0 100.0 0 93.6 0 69.7 0 47.0 0 31.8 0 16.6 0 6.7 0 2.9 0 38.3 0 46.0
stdev
pred. bins
macro recall
stdev
real bins
0.0 3.0 22.7 38.5 39.5 33.6 24.1 16.5 25.4 22.2
1 2 20 36 78 176 553 1672 362.4 317.3
100.0 64.1 36.9 33.4 29.2 24.7 19.3 11.8 31.3 39.9
0.0 17.1 15.2 11.8 9.5 7.1 5.1 2.6 9.8 8.5
1 2 20 23 32 36 41 49 29.0 25.5
(a) summary scenario sum true (bp)
sum false (bp)
overall prec.
description
115012.6
500.1
99.6
root+superkingdom
17937.9
28325.0
38.8
phylum+class+order
59286.4
54892.0
51.9
family+genus+species
50.7 68.7
all but unassigned all with unassigned
Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 5.9 5.8
1071 166 130 108 614 1000 39774 222527 264319 265390
0 0 1 0 0 0 30 1757 1788 1788
50 40 30 20 10 0
unassigned superkingdom
phylum
class
order
family
genus
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 5.1 4.0
48039 3734 5195 6657 17364 43855 80580 0 157385 205424
0 174 1718 2267 3022 4055 10693 39825 61754 61754
0 0 0 0 0 0 0 0 0 0
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.2 5.8 31.7 46.1 44.7 32.0 0.0 22.9 20.0
1 2 17 24 48 96 216 1153 222.3 194.6
100.0 82.3 65.6 63.8 59.4 53.0 35.0 0.0 51.3 57.4
0.0 8.3 31.8 28.3 32.1 33.4 35.5 0.0 24.2 21.2
1 2 20 23 32 36 41 49 29.0 25.5
(c) new species scenario sum true (bp)
sum false (bp)
overall prec.
description
55507
174
99.7
root+superkingdom
29216
7007
80.7
phylum+class+order
124435
54573
69.5
family+genus+species
71.8 76.9
all but unassigned all with unassigned
70 60 50 40 30 20 10 order
0.0 0.1 0.2 0.2 0.2 0.2 0.2 18.3 2.8 2.4
1 2 20 23 32 36 41 49 29.0 25.5
description
1403
0
100.0
root+superkingdom
852
1
99.9
phylum+class+order
263301
1787
99.3
family+genus+species
99.3 99.3
all but unassigned all with unassigned
60 50 40 30 20
false (bp) true (bp) macro precision α=0.99 macro recall
10 unassigned superkingdom
phylum
class
order
family
genus
species
Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
family
genus
species
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 4.3 2.5
101939 7388 7629 8751 20074 27269 0 0 71111 173050
0 386 4042 5633 8920 13450 33904 27793 94128 94128
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 96.1 78.4 53.1 33.8 12.5 0.0 0.0 39.1 46.7
stdev
pred. bins
macro recall
stdev
real bins
0.0 1.9 17.6 39.5 39.9 29.1 0.0 0.0 18.3 16.0
1 2 17 31 65 156 535 1788 370.6 324.4
100.0 67.2 40.7 38.3 32.7 20.1 0.0 0.0 28.4 37.4
0.0 14.9 29.4 26.6 27.5 23.1 0.0 0.0 17.4 15.2
1 2 20 23 32 36 41 49 29.0 25.5
(d) new genus scenario sum true (bp)
sum false (bp)
overall prec.
description
116715
386
99.7
root+superkingdom
36454
18595
66.2
phylum+class+order
27269
75147
26.6
family+genus+species
43.0 64.8
all but unassigned all with unassigned
CARMA3 binning on simulated metagenome with 49 species
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
assigned sequences in bp
% macro-precision and macro-recall
80
class
100.0 99.9 99.9 99.9 99.9 99.9 99.8 82.6 97.4 97.7
overall prec.
taxonomic rank
90
phylum
1 2 19 23 31 35 40 46 28.0 24.6
sum false (bp)
70
CARMA3 binning on simulated metagenome with 49 species
unassigned superkingdom
0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.7 0.4 0.3
sum true (bp)
80
0
100
0
real bins
100 90 80
assigned sequences in bp
rank
stdev
90
species
Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e)
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
macro recall
100
taxonomic rank
macro precision α=0.99 100.0 98.9 93.8 81.4 56.4 32.0 12.5 0.0 53.6 59.4
pred. bins
assigned sequences in bp
60
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
assigned sequences in bp
% macro-precision and macro-recall
90
70
stdev
CARMA3 binning on simulated metagenome with 49 species
CARMA3 binning on simulated metagenome with 49 species 100
80
macro unknown precision (bp) α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 99.5 0 99.9 0 99.9
(b) all reference scenario
70 60 50 40 30 20 10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 27 of 51
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 3.8 2.0
114225 11362 9527 9232 20457 0 0 0 50578 164803
0 536 6860 8904 13293 24317 18709 29756 102375 102375
macro unknown precision (bp) α=0.99 0 100.0 0 92.1 0 48.3 0 23.8 0 9.7 0 0.0 0 0.0 0 0.0 0 24.8 0 34.2
stdev
pred. bins
macro recall
stdev
real bins
0.0 3.8 32.1 32.6 22.8 0.0 0.0 0.0 13.0 11.4
1 2 18 36 81 196 625 1816 396.3 346.9
100.0 57.6 26.3 21.8 12.7 0.0 0.0 0.0 16.9 27.3
0.0 21.4 27.3 25.3 19.9 0.0 0.0 0.0 13.4 11.7
1 2 20 23 32 36 41 49 29.0 25.5
(e) new family scenario sum true (bp)
sum false (bp)
overall prec.
description
136949
536
99.6
root+superkingdom
39216
29057
57.4
phylum+class+order
0
72782
0.0
family+genus+species
33.1 61.7
all but unassigned all with unassigned
Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 3.5 1.6
130123 12342 8019 7231 0 0 0 0 27592 157715
0 706 10411 14435 20646 18233 12779 32253 109463 109463
60 50 40 30 20
false (bp) true (bp) macro precision α=0.99 macro recall
10 0
unassigned superkingdom
phylum
class
order
family
genus
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 3.4 1.4
139175 13051 4577 0 0 0 0 0 17628 156803
0 778 12085 18605 19608 15771 10539 32989 110375 110375
0 0 0 0 0 0 0 0 0 0
stdev
pred. bins
macro recall
stdev
real bins
0.0 9.8 20.1 0.0 0.0 0.0 0.0 0.0 4.3 3.7
1 2 22 41 91 206 657 1814 404.7 354.3
100.0 48.2 8.9 0.0 0.0 0.0 0.0 0.0 8.2 19.6
0.0 24.2 15.4 0.0 0.0 0.0 0.0 0.0 5.7 5.0
1 2 20 23 32 36 41 49 29.0 25.5
(g) new class scenario sum true (bp)
sum false (bp)
overall prec.
description
165277
778
99.5
root+superkingdom
4577
50298
8.3
phylum+class+order
0
59299
0.0
family+genus+species
13.8 58.7
all but unassigned all with unassigned
70 60 50 40 30 20 10 order
0.0 23.4 23.2 16.8 0.0 0.0 0.0 0.0 9.0 7.9
1 2 20 23 32 36 41 49 29.0 25.5
description
154807
706
99.5
root+superkingdom
15250
45492
25.1
phylum+class+order
0
63265
0.0
family+genus+species
20.1 59.0
all but unassigned all with unassigned
60 50 40 30 20
false (bp) true (bp) macro precision α=0.99 macro recall
10 unassigned superkingdom
phylum
class
order
family
genus
species
Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
family
genus
species
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 3.4 1.2
147652 13389 0 0 0 0 0 0 13389 161041
0 921 14479 14228 19118 14181 9565 33645 106137 106137
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 75.1 0.0 0.0 0.0 0.0 0.0 0.0 10.7 21.9
stdev
pred. bins
macro recall
stdev
real bins
0.0 17.3 0.0 0.0 0.0 0.0 0.0 0.0 2.5 2.2
1 2 24 42 93 214 664 1820 408.4 357.5
100.0 41.1 0.0 0.0 0.0 0.0 0.0 0.0 5.9 17.6
0.0 27.2 0.0 0.0 0.0 0.0 0.0 0.0 3.9 3.4
1 2 20 23 32 36 41 49 29.0 25.5
(h) new phylum scenario sum true (bp)
sum false (bp)
overall prec.
description
174430
921
99.5
root+superkingdom
0
47825
0.0
phylum+class+order
0
57391
0.0
family+genus+species
11.2 60.3
all but unassigned all with unassigned
CARMA3 binning on simulated metagenome with 49 species
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
assigned sequences in bp
% macro-precision and macro-recall
80
class
100.0 52.3 17.0 10.1 0.0 0.0 0.0 0.0 11.4 22.4
overall prec.
taxonomic rank
90
phylum
1 2 21 39 90 203 652 1810 402.4 352.3
sum false (bp)
70
CARMA3 binning on simulated metagenome with 49 species
unassigned superkingdom
0.0 7.2 30.5 21.5 0.0 0.0 0.0 0.0 8.5 7.4
sum true (bp)
80
0
100
0
real bins
100 90 80
assigned sequences in bp
rank
stdev
90
species
Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e)
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
macro recall
100
taxonomic rank
macro precision α=0.99 100.0 84.3 12.9 0.0 0.0 0.0 0.0 0.0 13.9 24.6
pred. bins
assigned sequences in bp
70
% macro-precision and macro-recall
90 80
stdev
CARMA3 binning on simulated metagenome with 49 species
100
assigned sequences in bp
% macro-precision and macro-recall
CARMA3 binning on simulated metagenome with 49 species
macro unknown precision (bp) α=0.99 0 100.0 0 87.7 0 27.9 0 10.0 0 0.0 0 0.0 0 0.0 0 0.0 0 17.9 0 28.2
(f) new order scenario
70 60 50 40 30 20 10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 28 of 51
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 2.4 1.7
62255.4 85269.4 5415.0 3302.9 6038.6 6638.4 18552.9 27832.6 153049.7 215305.1
0.0 8388.1 3937.6 1523.4 1545.7 2415.9 6525.4 27536.7 51872.9 51872.9
macro unknown precision (bp) α=0.99 0 100.0 0 97.8 0 89.4 0 61.7 0 43.3 0 22.4 0 9.3 0 5.4 0 47.0 0 53.7
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.1 8.3 41.3 45.1 38.7 27.9 21.9 26.2 22.9
1 2 19 33 66 139 400 824 211.9 185.5
100.0 64.8 35.9 34.3 32.3 28.0 21.2 11.8 32.6 41.0
0.0 21.3 14.3 9.9 9.3 8.0 5.9 4.5 10.4 9.1
1 2 20 23 32 36 41 49 29.0 25.5
(a) summary scenario sum true (bp)
sum false (bp)
overall prec.
description
232794.3
8388.1
96.5
root+superkingdom
14756.4
7006.7
67.8
phylum+class+order
53023.9
36478.0
59.2
family+genus+species
74.7 80.6
all but unassigned all with unassigned
Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 5.8 5.7
595 853 515 399 2034 5388 62555 194828 266572 267167
0 0 1 0 0 0 0 10 11 11
50 40 30 20 10 0
unassigned superkingdom
phylum
class
order
family
genus
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 4.3 3.7
22828 34238 3671 3130 9167 20053 67315 0 137574 160402
0 3128 867 282 388 979 3069 98063 106776 106776
0 0 0 0 0 0 0 0 0 0
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.1 1.9 1.8 34.4 47.3 41.1 0.0 18.1 15.8
1 2 17 21 35 55 121 218 67.0 58.8
100.0 87.1 67.3 70.8 70.5 65.3 49.0 0.0 58.6 63.8
0.0 9.1 33.8 29.0 31.9 34.3 40.8 0.0 25.6 22.4
1 2 20 23 32 36 41 49 29.0 25.5
(c) new species scenario sum true (bp)
sum false (bp)
overall prec.
description
91304
3128
96.7
root+superkingdom
15968
1537
91.2
phylum+class+order
87368
102111
46.1
family+genus+species
56.3 60.0
all but unassigned all with unassigned
70 60 50 40 30 20 10 order
0.0 0.1 0.1 0.1 0.1 0.3 1.6 31.3 4.8 4.2
1 2 20 23 32 36 41 49 29.0 25.5
description
2301
0
100.0
root+superkingdom
2948
1
100.0
phylum+class+order
262771
10
100.0
family+genus+species
100.0 100.0
all but unassigned all with unassigned
60 50 40 30 20
false (bp) true (bp) macro precision α=0.99 macro recall
10 unassigned superkingdom
phylum
class
order
family
genus
species
Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
family
genus
species
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 2.5 1.9
58109 89947 8288 6666 14535 21028 0 0 140464 198573
0 6636 1861 657 997 3013 20343 35098 68605 68605
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 98.8 93.1 76.6 52.0 22.3 0.0 0.0 49.0 55.3
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 9.2 34.7 45.1 38.5 0.0 0.0 18.2 15.9
1 2 15 25 50 105 274 430 128.7 112.8
100.0 74.5 40.0 41.8 41.1 30.6 0.0 0.0 32.6 41.0
0.0 15.4 29.9 26.2 29.8 30.6 0.0 0.0 18.8 16.5
1 2 20 23 32 36 41 49 29.0 25.5
(d) new genus scenario sum true (bp)
sum false (bp)
overall prec.
description
238003
6636
97.3
root+superkingdom
29489
3515
89.3
phylum+class+order
21028
58454
26.5
family+genus+species
67.2 74.3
all but unassigned all with unassigned
MEGAN binning of simulated metagenome with 49 species
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
assigned sequences in bp
% macro-precision and macro-recall
80
class
100.0 99.9 99.9 99.9 99.9 99.8 99.3 82.5 97.3 97.7
overall prec.
taxonomic rank
90
phylum
1 2 19 23 31 35 40 44 27.7 24.4
sum false (bp)
70
MEGAN binning of simulated metagenome with 49 species
unassigned superkingdom
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
sum true (bp)
80
0
100
0
real bins
100 90 80
assigned sequences in bp
rank
stdev
90
species
Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e)
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
macro recall
100
taxonomic rank
macro precision α=0.99 100.0 99.6 98.4 98.4 83.5 58.9 22.9 0.0 66.0 70.2
pred. bins
assigned sequences in bp
60
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
assigned sequences in bp
% macro-precision and macro-recall
90
70
stdev
MEGAN binning of simulated metagenome with 49 species
MEGAN binning of simulated metagenome with 49 species 100
80
macro unknown precision (bp) α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0
(b) all reference scenario
70 60 50 40 30 20 10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 29 of 51
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 1.9 1.3
73809 111523 10028 7514 16534 0 0 0 145599 219408
0 8005 2966 1043 1481 5906 8487 19882 47770 47770
macro unknown precision (bp) α=0.99 0 100.0 0 97.2 0 78.2 0 37.5 0 17.5 0 0.0 0 0.0 0 0.0 0 32.9 0 41.3
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.4 27.3 42.3 32.9 0.0 0.0 0.0 14.7 12.9
1 2 14 31 69 161 366 580 174.7 153.0
100.0 56.5 23.5 20.4 14.9 0.0 0.0 0.0 16.5 26.9
0.0 28.9 25.8 21.9 21.1 0.0 0.0 0.0 14.0 12.2
1 2 20 23 32 36 41 49 29.0 25.5
(e) new family scenario sum true (bp)
sum false (bp)
overall prec.
description
296855
8005
97.4
root+superkingdom
34076
5490
86.1
phylum+class+order
0
34275
0.0
family+genus+species
75.3 82.1
all but unassigned all with unassigned
Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 1.5 1.0
89682 118597 10011 5411 0 0 0 0 134019 223701
0 11005 5581 1881 2446 3416 5382 13766 43477 43477
60 50 40 30 20
false (bp) true (bp) macro precision α=0.99 macro recall
10 0
unassigned superkingdom
phylum
class
order
family
genus
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 1.4 0.9
94817 122665 5392 0 0 0 0 0 128057 222874
0 11467 7208 4356 2145 2203 4437 12488 44304 44304
0 0 0 0 0 0 0 0 0 0
stdev
pred. bins
macro recall
stdev
real bins
0.0 2.8 30.7 0.0 0.0 0.0 0.0 0.0 4.8 4.2
1 2 19 43 88 172 446 657 203.9 178.5
100.0 45.9 6.4 0.0 0.0 0.0 0.0 0.0 7.5 19.0
0.0 31.8 11.6 0.0 0.0 0.0 0.0 0.0 6.2 5.4
1 2 20 23 32 36 41 49 29.0 25.5
(g) new class scenario sum true (bp)
sum false (bp)
overall prec.
description
340147
11467
96.7
root+superkingdom
5392
13709
28.2
phylum+class+order
0
19128
0.0
family+genus+species
74.3 83.4
all but unassigned all with unassigned
70 60 50 40 30 20 10 order
0.0 30.9 21.7 11.7 0.0 0.0 0.0 0.0 9.2 8.0
1 2 20 23 32 36 41 49 29.0 25.5
description
326876
11005
96.7
root+superkingdom
15422
9908
60.9
phylum+class+order
0
22564
0.0
family+genus+species
75.5 83.7
all but unassigned all with unassigned
60 50 40 30 20
false (bp) true (bp) macro precision α=0.99 macro recall
10 unassigned superkingdom
phylum
class
order
family
genus
species
Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
family
genus
species
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 1.3 0.8
95948 119063 0 0 0 0 0 0 119063 215011
0 18476 9079 2445 3363 1394 3960 13450 52167 52167
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 84.6 0.0 0.0 0.0 0.0 0.0 0.0 12.1 23.1
stdev
pred. bins
macro recall
stdev
real bins
0.0 8.8 0.0 0.0 0.0 0.0 0.0 0.0 1.3 1.1
1 2 25 45 96 197 494 814 239.0 209.3
100.0 39.5 0.0 0.0 0.0 0.0 0.0 0.0 5.6 17.4
0.0 33.1 0.0 0.0 0.0 0.0 0.0 0.0 4.7 4.1
1 2 20 23 32 36 41 49 29.0 25.5
(h) new phylum scenario sum true (bp)
sum false (bp)
overall prec.
description
334074
18476
94.8
root+superkingdom
0
14887
0.0
phylum+class+order
0
18804
0.0
family+genus+species
69.5 80.5
all but unassigned all with unassigned
MEGAN binning of simulated metagenome with 49 species
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
assigned sequences in bp
% macro-precision and macro-recall
80
class
100.0 49.8 13.8 7.0 0.0 0.0 0.0 0.0 10.1 21.3
overall prec.
taxonomic rank
90
phylum
1 2 17 39 84 167 401 565 182.1 159.5
sum false (bp)
70
MEGAN binning of simulated metagenome with 49 species
unassigned superkingdom
0.0 1.3 37.8 28.3 0.0 0.0 0.0 0.0 9.6 8.4
sum true (bp)
80
0
100
0
real bins
100 90 80
assigned sequences in bp
rank
stdev
90
species
Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e)
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
macro recall
100
taxonomic rank
macro precision α=0.99 100.0 93.2 25.1 0.0 0.0 0.0 0.0 0.0 16.9 27.3
pred. bins
assigned sequences in bp
70
% macro-precision and macro-recall
90 80
stdev
MEGAN binning of simulated metagenome with 49 species
100
assigned sequences in bp
% macro-precision and macro-recall
MEGAN binning of simulated metagenome with 49 species
macro unknown precision (bp) α=0.99 0 100.0 0 95.5 0 47.4 0 14.4 0 0.0 0 0.0 0 0.0 0 0.0 0 22.5 0 32.2
(f) new order scenario
70 60 50 40 30 20 10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 30 of 51
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 2.1 1.5
75644.4 106494.7 7691.6 3656.6 5832.6 7550.6 20271.9 11952.3 163450.1 239094.6
0.0 10293.3 9493.1 2344.0 2014.3 1079.7 1397.1 1461.9 28083.4 28083.4
macro unknown precision (bp) α=0.99 0 100.0 0 96.9 0 94.9 0 91.2 0 85.9 0 76.4 0 65.9 0 61.1 0 81.8 0 84.0
stdev
pred. bins
macro recall
stdev
real bins
0.0 2.5 9.2 21.5 31.8 39.8 46.4 47.2 28.3 24.8
1 2 16 21 34 44 58 65 34.3 30.1
100.0 56.8 18.2 18.3 16.2 13.8 9.4 2.5 19.3 29.4
0.0 33.5 13.5 11.6 9.5 8.2 7.7 4.4 12.6 11.0
1 2 20 23 32 36 41 49 29.0 25.5
(a) summary scenario sum true (bp)
sum false (bp)
overall prec.
description
288633.9
10293.3
96.6
root+superkingdom
17180.7
13851.4
55.4
phylum+class+order
39774.7
3938.7
91.0
family+genus+species
85.3 89.5
all but unassigned all with unassigned
Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 4.4 3.6
34453 35661 2897 1947 8254 19632 80667 83666 232724 267177
0 0 0 0 0 0 0 1 1 1
50 40 30 20 10 0
unassigned superkingdom
phylum
class
order
family
genus
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 3.4 2.6
63523 75519 8526 6516 14335 21470 61236 0 187602 251125
0 4247 1834 515 322 246 1365 7524 16053 16053
0 0 0 0 0 0 0 0 0 0
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.9 1.4 0.5 0.5 0.8 30.3 0.0 4.9 4.3
1 2 15 18 25 28 26 48 23.1 20.4
100.0 74.7 39.3 42.8 39.9 34.9 22.6 0.0 36.3 44.3
0.0 16.4 28.6 27.4 26.9 28.6 28.3 0.0 22.3 19.5
1 2 20 23 32 36 41 49 29.0 25.5
(c) new species scenario sum true (bp)
sum false (bp)
overall prec.
description
214561
4247
98.1
root+superkingdom
29377
2671
91.7
phylum+class+order
82706
9135
90.1
family+genus+species
92.1 94.0
all but unassigned all with unassigned
70 60 50 40 30 20 10 order
0.0 8.4 28.4 29.1 30.6 31.5 34.0 30.5 27.5 24.1
1 2 20 23 32 36 41 49 29.0 25.5
description
105775
0
100.0
root+superkingdom
13098
0
100.0
phylum+class+order
183965
1
100.0
family+genus+species
100.0 100.0
all but unassigned all with unassigned
60 50 40 30 20
false (bp) true (bp) macro precision α=0.99 macro recall
10 unassigned superkingdom
phylum
class
order
family
genus
species
Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
family
genus
species
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 1.8 1.3
82737 117640 12404 6555 12254 11752 0 0 160605 243342
0 8730 4567 1439 1241 1508 4633 1718 23836 23836
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 98.1 98.2 98.5 97.1 56.2 0.0 0.0 64.0 68.5
stdev
pred. bins
macro recall
stdev
real bins
0.0 1.4 3.3 1.9 4.9 46.1 0.0 0.0 8.2 7.2
1 2 13 16 22 33 52 49 26.7 23.5
100.0 61.5 20.1 20.9 17.6 9.5 0.0 0.0 18.5 28.7
0.0 31.7 21.3 21.8 20.2 16.7 0.0 0.0 16.0 14.0
1 2 20 23 32 36 41 49 29.0 25.5
(d) new genus scenario sum true (bp)
sum false (bp)
overall prec.
description
318017
8730
97.3
root+superkingdom
31213
7247
81.2
phylum+class+order
11752
7859
59.9
family+genus+species
87.1 91.1
all but unassigned all with unassigned
taxator-tk binning of simulated metagenome with 49 species
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
assigned sequences in bp
% macro-precision and macro-recall
80
class
100.0 68.1 46.5 52.5 51.7 52.1 43.4 17.7 47.4 54.0
overall prec.
taxonomic rank
90
phylum
1 2 17 21 29 32 34 34 24.1 21.3
sum false (bp)
70
taxator-tk binning of simulated metagenome with 49 species
unassigned superkingdom
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
sum true (bp)
80
0
100
0
real bins
100 90 80
assigned sequences in bp
rank
stdev
90
species
Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e)
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
macro recall
100
taxonomic rank
macro precision α=0.99 100.0 99.1 99.4 99.7 99.7 99.7 88.7 0.0 83.8 85.8
pred. bins
assigned sequences in bp
60
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
assigned sequences in bp
% macro-precision and macro-recall
90
70
stdev
taxator-tk binning of simulated metagenome with 49 species
taxator-tk binning of simulated metagenome with 49 species 100
80
macro unknown precision (bp) α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0
(b) all reference scenario
70 60 50 40 30 20 10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 31 of 51
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 1.4 1.0
86887 129189 13258 6610 5985 0 0 0 155042 241929
0 10849 6399 1866 1537 2768 1474 356 25249 25249
macro unknown precision (bp) α=0.99 0 100.0 0 95.9 0 95.0 0 96.0 0 38.6 0 0.0 0 0.0 0 0.0 0 46.5 0 53.2
stdev
pred. bins
macro recall
stdev
real bins
0.0 2.0 6.2 4.4 46.6 0.0 0.0 0.0 8.5 7.4
1 2 8 12 27 48 85 81 37.6 33.0
100.0 50.8 12.1 8.9 4.3 0.0 0.0 0.0 10.9 22.0
0.0 42.7 19.5 14.0 8.3 0.0 0.0 0.0 12.1 10.6
1 2 20 23 32 36 41 49 29.0 25.5
(e) new family scenario sum true (bp)
sum false (bp)
overall prec.
description
345265
10849
97.0
root+superkingdom
25853
9802
72.5
phylum+class+order
0
4598
0.0
family+genus+species
86.0 90.5
all but unassigned all with unassigned
Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 1.3 0.9
83674 131010 11454 3968 0 0 0 0 146432 230106
0 14172 14889 3071 2444 1364 901 231 37072 37072
60 50 40 30 20
false (bp) true (bp) macro precision α=0.99 macro recall
10 0
unassigned superkingdom
phylum
class
order
family
genus
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 1.3 0.8
88280 132003 5302 0 0 0 0 0 137305 225585
0 14601 17992 4839 2332 961 672 196 41593 41593
0 0 0 0 0 0 0 0 0 0
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 32.3 0.0 0.0 0.0 0.0 0.0 4.6 4.0
1 1 5 17 41 73 107 74 45.4 39.9
100.0 47.9 2.3 0.0 0.0 0.0 0.0 0.0 7.2 18.8
0.0 44.8 5.5 0.0 0.0 0.0 0.0 0.0 7.2 6.3
1 2 20 23 32 36 41 49 29.0 25.5
(g) new class scenario sum true (bp)
sum false (bp)
overall prec.
description
352286
14601
96.0
root+superkingdom
5302
25163
17.4
phylum+class+order
0
1829
0.0
family+genus+species
76.8 84.4
all but unassigned all with unassigned
70 60 50 40 30 20 10 order
0.0 45.1 16.0 7.2 0.0 0.0 0.0 0.0 9.7 8.5
1 2 20 23 32 36 41 49 29.0 25.5
description
345694
14172
96.1
root+superkingdom
15422
20404
43.0
phylum+class+order
0
2496
0.0
family+genus+species
79.8 86.1
all but unassigned all with unassigned
60 50 40 30 20
false (bp) true (bp) macro precision α=0.99 macro recall
10 unassigned superkingdom
phylum
class
order
family
genus
species
Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
family
genus
species
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 1.3 0.9
89957 124441 0 0 0 0 0 0 124441 214398
0 19454 20771 4678 6224 711 735 207 52780 52780
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 89.7 0.0 0.0 0.0 0.0 0.0 0.0 12.8 23.7
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 6 15 27 94 141 100 54.9 48.1
100.0 45.6 0.0 0.0 0.0 0.0 0.0 0.0 6.5 18.2
0.0 45.0 0.0 0.0 0.0 0.0 0.0 0.0 6.4 5.6
1 2 20 23 32 36 41 49 29.0 25.5
(h) new phylum scenario sum true (bp)
sum false (bp)
overall prec.
description
338839
19454
94.6
root+superkingdom
0
31673
0.0
phylum+class+order
0
1653
0.0
family+genus+species
70.2 80.2
all but unassigned all with unassigned
taxator-tk binning of simulated metagenome with 49 species
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
assigned sequences in bp
% macro-precision and macro-recall
80
class
100.0 48.7 7.2 2.9 0.0 0.0 0.0 0.0 8.4 19.8
overall prec.
taxonomic rank
90
phylum
1 1 6 16 38 70 103 83 45.3 39.8
sum false (bp)
70
taxator-tk binning of simulated metagenome with 49 species
unassigned superkingdom
0.0 0.0 25.5 40.1 0.0 0.0 0.0 0.0 9.4 8.2
sum true (bp)
80
0
100
0
real bins
100 90 80
assigned sequences in bp
rank
stdev
90
species
Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e)
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
macro recall
100
taxonomic rank
macro precision α=0.99 100.0 91.8 51.5 0.0 0.0 0.0 0.0 0.0 20.5 30.4
pred. bins
assigned sequences in bp
70
% macro-precision and macro-recall
90 80
stdev
taxator-tk binning of simulated metagenome with 49 species
100
assigned sequences in bp
% macro-precision and macro-recall
taxator-tk binning of simulated metagenome with 49 species
macro unknown precision (bp) α=0.99 0 100.0 0 92.7 0 73.6 0 36.7 0 0.0 0 0.0 0 0.0 0 0.0 0 29.0 0 37.9
(f) new order scenario
70 60 50 40 30 20 10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 32 of 51
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
Supplementary Figure S14: Bin precision plots for 49 species simulated metagenomic sample (simArt49e)
Supplementary Figure S14: Bin precision plots for 49 species simulated metagenomic sample (simArt49e)
Supplementary Figure S14: Bin precision plots for 49 species simulated metagenomic sample (simArt49e)
Supplementary Figure S14: Bin precision plots for 49 species simulated metagenomic sample (simArt49e) Comparison of assignment quality of CARMA3, MEGAN4 and taxator-tk for a simulated metagenome sample from a 49 species microbial community. Values are shown for the summary scenario (sum of all seven cross-validation scenarios), for assignments to the (a) species, (b) genus, (c) family, (d) order, (e) class and (f) phylum ranks, respectively. The first of each panels shows the precision and size for every predicted bin (after removing low abundance bins). The colored line shows a smoothed k-nearest-neighbor estimate of the mean precision as a function of predicted bin size using the R function wapply (width=0.3) followed by smooth.spline (df=10). The second panel for each rank shows bin precisions relative to recall. The F-score partitioning helps to identify similar quality bins if precision and recall are equally weighted, however we consider precision more important than recall. The third panel illustrates the total number of true (blue) and false (red) and unassigned (gray) portion of assignments at the respective ranks. Note that partially incorrect assignments are considered incorrect for the low ranking false part of the assignment and correct for the higher ranks.
taxator-tk Supplementary Material Page 33 of 51
Supplementary Figure S15: Taxonomic composition of SimMC/AMD
all
Methanococcoides burtonii 0.01%
Taxonomic composition of the FAMeS simulated metagenome sample SimMC/AMD using Krona (Ondov et al., 2011). An interactive version can be found in the supplementary files (SimMC.krona.html). Abundance is measured in terms of accumulated contigs lengths.
taxator-tk Supplementary Material Page 34 of 51
Supplementary Figure S16: Taxonomic composition of SimHC/soil
all
Taxonomic composition of the FAMeS simulated metagenome sample SimHC/soil using Krona (Ondov et al., 2011). An interactive version can be found in the supplementary files (SimHC.krona.html). Abundance is measured in terms of accumulated contigs lengths.
taxator-tk Supplementary Material Page 35 of 51
Supplementary Figure S17 - MEGAN binning for FAMeS SimMC true (kb)
false (kb)
0 1 2 3 4 5 6 7 3.3 3.1
877.9 2428.7 2508.3 1611.6 484.1 1590.7 811.4 2332.3 11767.1 12645.0
0.0 7.5 60.0 389.1 646.1 617.3 1102.6 1572.8 4395.3 4395.3
macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 18.7 0 14.8 0 9.8 0 6.1 0 3.9 0 3.0 0 22.3 0 32.0
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 32.6 29.4 23.8 21.5 18.0 16.5 20.2 17.7
1 1 8 17 39 69 131 188 64.7 56.8
100.0 45.0 35.4 24.3 15.6 7.5 3.5 1.8 19.0 29.1
0.0 45.0 23.9 19.7 16.5 13.0 7.2 4.7 18.6 16.2
1 2 8 12 23 30 37 47 22.7 20.0
sum true (kb)
sum false (kb)
overall prec.
description
5735.3
7.5
99.9
root+superkingdom
4604.0
1095.1
80.8
phylum+class+order
4734.4
3292.7
59.0
family+genus+species
72.8 74.2
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (kb)
false (kb)
0 1 2 3 4 5 6 7 5.6 5.6
2.03 9.44 21.2 34.76 16.19 28.28 602.48 16325.9 17038.25 17040.28
0 0 0 0 0 0 0 0 0 0
70
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 phylum
class
order
family
genus
species
false (kb) true (kb) macro precision α=0.99 macro recall
% macro-precision and macro-recall
14,000
80
assigned sequences in kb
% macro-precision and macro-recall
16,000
90
unassigned superkingdom
rank
depth
true (kb)
false (kb)
unknown (kb)
0 1 2 3 4 5 6 7 5.0 4.9
234.56 273.59 1162.26 683.93 576.89 4640.58 5077.13 0 12414.38 12648.94
0 0 2.62 59.82 63.11 256.49 1966.57 2042.71 4391.32 4391.32
0 0 0 0 0 0 0 0 0 0
(c) new species scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 44.7 40.3 38.4 28.8 0.0 21.7 19.0
1 1 1 3 10 18 24 32 12.7 11.3
100.0 48.4 55.4 45.6 36.2 21.3 6.1 0.0 30.4 39.1
0.0 48.4 44.1 37.9 39.5 36.6 18.7 0.0 32.2 28.2
1 2 8 12 23 30 37 47 22.7 20.0
sum true (kb)
sum false (kb)
overall prec.
description
781.74
0
100.0
root+superkingdom
2423.08
125.55
95.1
phylum+class+order
9717.71
4265.77
69.5
family+genus+species
73.9 74.2
all but unassigned all with unassigned
100.0 50.0 49.3 49.2 33.9 19.7 18.7 12.5 33.3 41.7
0.0 50.0 49.3 49.2 46.5 39.5 38.7 32.6 43.7 38.2
1 2 8 12 23 30 37 47 22.7 20.0
overall prec.
description
20.91
0
100.0
root+superkingdom
72.15
0
100.0
phylum+class+order
16956.66
0
100.0
family+genus+species
100.0 100.0
all but unassigned all with unassigned
14,000
70
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (kb) true (kb) macro precision α=0.99 macro recall
0
Supplementary Figure S17 - MEGAN binning for FAMeS SimMC rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 family
genus
species
0
false (kb) true (kb) macro precision α=0.99 macro recall
% macro-precision and macro-recall
14,000
70
assigned sequences in kb
% macro-precision and macro-recall
80
order
1 1 1 2 3 3 4 4 2.6 2.4
sum false (kb)
depth
true (kb)
false (kb)
unknown (kb)
0 1 2 3 4 5 6 7 4.3 4.2
358.62 526.48 1889.35 1360.44 1314.65 6466.24 0 0 11557.16 11915.78
0 0 2.62 89.37 128.45 303.88 2126.87 2473.3 5124.49 5124.49
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 100.0 100.0 65.6 31.8 13.9 0.0 0.0 44.5 51.4
(d) new genus scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 45.6 40.5 32.2 0.0 0.0 16.9 14.8
1 1 1 3 11 17 39 45 16.7 14.8
100.0 47.7 54.3 40.3 26.2 11.2 0.0 0.0 25.7 34.9
0.0 47.7 43.3 35.8 31.9 28.1 0.0 0.0 26.7 23.3
1 2 8 12 23 30 37 47 22.7 20.0
sum true (kb)
sum false (kb)
overall prec.
description
1411.58
0
100.0
root+superkingdom
4564.44
220.44
95.4
phylum+class+order
6466.24
4904.05
56.9
family+genus+species
69.3 69.9
all but unassigned all with unassigned
MEGAN binning for FAMeS SimMC 16,000
class
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
sum true (kb)
taxonomic rank
90
phylum
real bins
16,000
MEGAN binning for FAMeS SimMC
unassigned superkingdom
stdev
80
0
100
0
macro recall
90
0
Supplementary Figure S17 - MEGAN binning for FAMeS SimMC
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
pred. bins
100
taxonomic rank
macro precision α=0.99 100.0 100.0 100.0 66.8 36.1 21.0 11.0 0.0 47.8 54.4
stdev
MEGAN binning for FAMeS SimMC
MEGAN binning for FAMeS SimMC 100
0
macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0
(b) all reference scenario
assigned sequences in kb
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S17 - MEGAN binning for FAMeS SimMC
100 16,000
90 80
14,000
70
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 36 of 51
family
genus
species
0
assigned sequences in kb
rank
(a) summary scenario
false (kb) true (kb) macro precision α=0.99 macro recall
Supplementary Figure S17 - MEGAN binning for FAMeS SimMC true (kb)
false (kb)
0 1 2 3 4 5 6 7 2.9 2.8
663.34 1775.34 4398.98 4868.55 1480.8 0 0 0 12523.67 13187.01
0 0 13.81 130.75 325.61 1031.59 839.84 1511.68 3853.28 3853.28
macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 52.5 0 38.0 0 9.8 0 0.0 0 0.0 0 0.0 0 28.6 0 37.5
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 47.5 42.5 23.2 0.0 0.0 0.0 16.2 14.1
1 1 2 5 18 28 47 47 21.1 18.6
100.0 45.8 25.8 23.4 12.7 0.0 0.0 0.0 15.4 26.0
0.0 45.8 33.8 24.1 21.0 0.0 0.0 0.0 17.8 15.6
1 2 8 12 23 30 37 47 22.7 20.0
sum true (kb)
sum false (kb)
overall prec.
description
4214.02
0
100.0
root+superkingdom
10748.33
470.17
95.8
phylum+class+order
0
3383.11
0.0
family+genus+species
76.5 77.4
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (kb)
false (kb)
0 1 2 3 4 5 6 7 2.7 2.5
767.74 2271.96 5432.33 4333.64 0 0 0 0 12037.93 12805.67
0 0 39.29 193 561.43 1006.4 817.67 1616.82 4234.61 4234.61
14,000
80 70
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (kb) true (kb) macro precision α=0.99 macro recall
% macro-precision and macro-recall
16,000
90
0
rank
depth
true (kb)
false (kb)
unknown (kb)
0 1 2 3 4 5 6 7 2.5 2.3
1274.66 4959.38 4654.01 0 0 0 0 0 9613.39 10888.05
0 0 87.88 1661.84 1445.26 960.19 1106.84 890.22 6152.23 6152.23
0 0 0 0 0 0 0 0 0 0
(g) new class scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 34.7 0.0 0.0 0.0 0.0 0.0 5.0 4.3
1 1 7 12 31 40 57 46 27.7 24.4
100.0 42.4 21.4 0.0 0.0 0.0 0.0 0.0 9.1 20.5
0.0 42.4 34.5 0.0 0.0 0.0 0.0 0.0 11.0 9.6
1 2 8 12 23 30 37 47 22.7 20.0
sum true (kb)
sum false (kb)
overall prec.
description
11193.42
0
100.0
root+superkingdom
4654.01
3194.98
59.3
phylum+class+order
0
2957.25
0.0
family+genus+species
61.0 63.9
all but unassigned all with unassigned
100.0 45.1 41.6 11.5 0.0 0.0 0.0 0.0 14.0 24.8
0.0 45.1 40.0 13.9 0.0 0.0 0.0 0.0 14.1 12.4
1 2 8 12 23 30 37 47 22.7 20.0
overall prec.
description
5311.66
0
100.0
root+superkingdom
9765.97
793.72
92.5
phylum+class+order
0
3440.89
0.0
family+genus+species
74.0 75.1
all but unassigned all with unassigned
14,000
70
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (kb) true (kb) macro precision α=0.99 macro recall
0
Supplementary Figure S17 - MEGAN binning for FAMeS SimMC rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 family
genus
species
0
false (kb) true (kb) macro precision α=0.99 macro recall
% macro-precision and macro-recall
14,000
70
assigned sequences in kb
% macro-precision and macro-recall
80
order
1 1 3 7 21 30 43 39 20.6 18.1
sum false (kb)
depth
true (kb)
false (kb)
unknown (kb)
0 1 2 3 4 5 6 7 2.3 1.8
2844.15 7184.81 0 0 0 0 0 0 7184.81 10028.96
0 52.43 273.68 588.78 1998.56 762.4 860.34 2475.13 7011.32 7011.32
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 14.3 25.0
(h) new phylum scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 14 26 40 61 72 69 40.4 35.5
100.0 35.3 0.0 0.0 0.0 0.0 0.0 0.0 5.0 16.9
0.0 35.3 0.0 0.0 0.0 0.0 0.0 0.0 5.0 4.4
1 2 8 12 23 30 37 47 22.7 20.0
sum true (kb)
sum false (kb)
overall prec.
description
17213.77
52.43
99.7
root+superkingdom
0
2861.02
0.0
phylum+class+order
0
4097.87
0.0
family+genus+species
50.6 58.9
all but unassigned all with unassigned
MEGAN binning for FAMeS SimMC 16,000
class
0.0 0.0 43.7 36.9 0.0 0.0 0.0 0.0 11.5 10.1
sum true (kb)
taxonomic rank
90
phylum
real bins
16,000
MEGAN binning for FAMeS SimMC
unassigned superkingdom
stdev
80
0
100
0
macro recall
90
0
Supplementary Figure S17 - MEGAN binning for FAMeS SimMC
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
pred. bins
100
taxonomic rank
macro precision α=0.99 100.0 100.0 15.0 0.0 0.0 0.0 0.0 0.0 16.4 26.9
stdev
MEGAN binning for FAMeS SimMC
100
assigned sequences in kb
% macro-precision and macro-recall
MEGAN binning for FAMeS SimMC
macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 38.6 0 25.7 0 0.0 0 0.0 0 0.0 0 0.0 0 23.5 0 33.0
(f) new order scenario
assigned sequences in kb
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S17 - MEGAN binning for FAMeS SimMC
100 16,000
90 80
14,000
70
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 37 of 51
family
genus
species
0
assigned sequences in kb
rank
(e) new family scenario
false (kb) true (kb) macro precision α=0.99 macro recall
Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC true (kb)
false (kb)
0 1 2 3 4 5 6 7 2.6 2.3
2083.8 4704.3 3460.1 1860.6 560.8 1573.3 1012.7 978.0 14149.8 16233.6
0.0 0.6 26.9 182.2 257.2 89.4 196.9 53.3 806.6 806.6
macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 52.0 0 49.0 0 40.7 0 22.8 0 37.5 0 39.2 0 48.7 0 55.2
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 36.2 47.6 41.6 38.3 45.7 48.4 36.8 32.2
1 1 4 4 15 19 19 54 16.6 14.6
100.0 62.5 38.1 24.0 18.6 12.8 8.0 5.0 24.1 33.6
0.0 19.6 13.3 14.2 12.0 10.8 7.9 6.3 12.0 10.5
1 2 8 12 23 30 37 47 22.7 20.0
sum true (kb)
sum false (kb)
overall prec.
description
11492.5
0.6
100.0
root+superkingdom
5881.5
466.4
92.7
phylum+class+order
3564.0
339.7
91.3
family+genus+species
94.6 95.3
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (kb)
false (kb)
0 1 2 3 4 5 6 7 4.0 3.9
251.07 1303.39 1673.66 1129.62 647.31 1728.38 3460.93 6845.91 16789.2 17040.27
0 0 0 0 0 0 0 0 0 0
70
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 phylum
class
order
family
genus
species
false (kb) true (kb) macro precision α=0.99 macro recall
% macro-precision and macro-recall
14,000
80
assigned sequences in kb
% macro-precision and macro-recall
16,000
90
unassigned superkingdom
rank
depth
true (kb)
false (kb)
unknown (kb)
0 1 2 3 4 5 6 7 4.0 3.6
1558.82 1613.56 2761.05 1806.58 1024.51 3915.66 3628.14 0 14749.5 16308.32
0 0 2.62 27.87 19.75 33.96 630.72 17.05 731.97 731.97
0 0 0 0 0 0 0 0 0 0
(c) new species scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 42.1 35.6 36.6 44.8 0.0 22.7 19.9
1 1 1 3 7 6 3 11 4.6 4.1
100.0 95.8 69.6 39.9 32.4 22.0 8.8 0.0 38.4 46.1
0.0 4.2 30.2 35.3 34.2 34.7 24.1 0.0 23.2 20.3
1 2 8 12 23 30 37 47 22.7 20.0
sum true (kb)
sum false (kb)
overall prec.
description
4785.94
0
100.0
root+superkingdom
5592.14
50.24
99.1
phylum+class+order
7543.8
681.73
91.7
family+genus+species
95.3 95.7
all but unassigned all with unassigned
100.0 98.8 85.8 73.6 68.7 59.0 47.1 34.9 66.8 71.0
0.0 1.2 15.8 22.1 29.4 42.0 43.6 44.2 28.3 24.8
1 2 8 12 23 30 37 47 22.7 20.0
overall prec.
description
2857.85
0
100.0
root+superkingdom
3450.59
0
100.0
phylum+class+order
12035.22
0
100.0
family+genus+species
100.0 100.0
all but unassigned all with unassigned
14,000
70
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (kb) true (kb) macro precision α=0.99 macro recall
0
Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 family
genus
species
0
false (kb) true (kb) macro precision α=0.99 macro recall
% macro-precision and macro-recall
14,000
70
assigned sequences in kb
% macro-precision and macro-recall
80
order
1 1 1 2 4 3 5 5 3.0 2.8
sum false (kb)
depth
true (kb)
false (kb)
unknown (kb)
0 1 2 3 4 5 6 7 3.4 3.1
1745.53 1976.55 3398.87 2312.46 1189.02 5368.74 0 0 14245.64 15991.17
0 0 2.62 32.37 40.59 54.74 636.48 282.3 1049.1 1049.1
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 100.0 100.0 66.9 55.5 50.0 0.0 0.0 53.2 59.0
(d) new genus scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 44.9 42.8 46.3 0.0 0.0 19.2 16.8
1 1 1 3 7 7 15 9 6.1 5.5
100.0 94.6 63.2 30.2 22.2 8.3 0.0 0.0 31.2 39.8
0.0 5.4 32.7 32.2 32.0 22.3 0.0 0.0 17.8 15.6
1 2 8 12 23 30 37 47 22.7 20.0
sum true (kb)
sum false (kb)
overall prec.
description
5698.63
0
100.0
root+superkingdom
6900.35
75.58
98.9
phylum+class+order
5368.74
973.52
84.7
family+genus+species
93.1 93.8
all but unassigned all with unassigned
taxator-tk binning for FAMeS SimMC 16,000
class
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
sum true (kb)
taxonomic rank
90
phylum
real bins
16,000
taxator-tk binning for FAMeS SimMC
unassigned superkingdom
stdev
80
0
100
0
macro recall
90
0
Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
pred. bins
100
taxonomic rank
macro precision α=0.99 100.0 100.0 100.0 69.1 75.1 81.0 63.1 0.0 69.7 73.5
stdev
taxator-tk binning for FAMeS SimMC
taxator-tk binning for FAMeS SimMC 100
0
macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0
(b) all reference scenario
assigned sequences in kb
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC
100 16,000
90 80
14,000
70
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 38 of 51
family
genus
species
0
assigned sequences in kb
rank
(a) summary scenario
false (kb) true (kb) macro precision α=0.99 macro recall
Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC true (kb)
false (kb)
0 1 2 3 4 5 6 7 2.2 2.0
1444.21 4049.76 5463.82 4607.11 1065.02 0 0 0 15185.71 16629.92
0 0 11.04 80.28 96.56 179.84 32.8 9.85 410.37 410.37
macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 100.0 0 61.6 0 20.6 0 0.0 0 0.0 0 0.0 0 40.3 0 47.8
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 43.9 37.4 0.0 0.0 0.0 11.6 10.2
1 1 1 3 18 21 14 7 9.3 8.3
100.0 42.8 31.1 21.0 6.8 0.0 0.0 0.0 14.5 25.2
0.0 42.8 36.5 31.0 17.0 0.0 0.0 0.0 18.2 15.9
1 2 8 12 23 30 37 47 22.7 20.0
sum true (kb)
sum false (kb)
overall prec.
description
9543.73
0
100.0
root+superkingdom
11135.95
187.88
98.3
phylum+class+order
0
222.49
0.0
family+genus+species
97.4 97.6
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (kb)
false (kb)
0 1 2 3 4 5 6 7 2.0 1.8
1665.24 5067.52 6525.5 3168.64 0 0 0 0 14761.66 16426.9
0 0 14.12 86.66 288.11 169.88 40.01 14.6 613.38 613.38
14,000
80 70
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (kb) true (kb) macro precision α=0.99 macro recall
% macro-precision and macro-recall
16,000
90
0
rank
depth
true (kb)
false (kb)
unknown (kb)
0 1 2 3 4 5 6 7 1.6 1.3
2853.36 8356.71 4397.58 0 0 0 0 0 12754.29 15607.65
0 0 25.84 659.76 597.06 108.29 23.82 17.85 1432.62 1432.62
0 0 0 0 0 0 0 0 0 0
(g) new class scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 47.1 0.0 0.0 0.0 0.0 0.0 6.7 5.9
1 1 3 11 18 21 14 9 11.0 9.8
100.0 36.9 3.2 0.0 0.0 0.0 0.0 0.0 5.7 17.5
0.0 36.9 8.5 0.0 0.0 0.0 0.0 0.0 6.5 5.7
1 2 8 12 23 30 37 47 22.7 20.0
sum true (kb)
sum false (kb)
overall prec.
description
19566.78
0
100.0
root+superkingdom
4397.58
1282.66
77.4
phylum+class+order
0
149.96
0.0
family+genus+species
89.9 91.6
all but unassigned all with unassigned
100.0 41.4 13.5 3.3 0.0 0.0 0.0 0.0 8.3 19.8
0.0 41.4 20.3 6.6 0.0 0.0 0.0 0.0 9.8 8.5
1 2 8 12 23 30 37 47 22.7 20.0
overall prec.
description
11800.28
0
100.0
root+superkingdom
9694.14
388.89
96.1
phylum+class+order
0
224.49
0.0
family+genus+species
96.0 96.4
all but unassigned all with unassigned
14,000
70
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (kb) true (kb) macro precision α=0.99 macro recall
0
Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 family
genus
species
0
false (kb) true (kb) macro precision α=0.99 macro recall
% macro-precision and macro-recall
14,000
70
assigned sequences in kb
% macro-precision and macro-recall
80
order
1 1 1 4 19 17 14 9 9.3 8.3
sum false (kb)
depth
true (kb)
false (kb)
unknown (kb)
0 1 2 3 4 5 6 7 1.3 0.9
5068.27 10562.94 0 0 0 0 0 0 10562.94 15631.21
0 4.05 132.26 388.44 758.65 79.35 14.55 31.77 1409.07 1409.07
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 14.3 25.0
(h) new phylum scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 10 18 19 24 17 14 14.7 13.0
100.0 27.3 0.0 0.0 0.0 0.0 0.0 0.0 3.9 15.9
0.0 27.3 0.0 0.0 0.0 0.0 0.0 0.0 3.9 3.4
1 2 8 12 23 30 37 47 22.7 20.0
sum true (kb)
sum false (kb)
overall prec.
description
26194.15
4.05
100.0
root+superkingdom
0
1279.35
0.0
phylum+class+order
0
125.67
0.0
family+genus+species
88.2 91.7
all but unassigned all with unassigned
taxator-tk binning for FAMeS SimMC 16,000
class
0.0 0.0 0.0 45.6 0.0 0.0 0.0 0.0 6.5 5.7
sum true (kb)
taxonomic rank
90
phylum
real bins
16,000
taxator-tk binning for FAMeS SimMC
unassigned superkingdom
stdev
80
0
100
0
macro recall
90
0
Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
pred. bins
100
taxonomic rank
macro precision α=0.99 100.0 100.0 33.3 0.0 0.0 0.0 0.0 0.0 19.0 29.2
stdev
taxator-tk binning for FAMeS SimMC
100
assigned sequences in kb
% macro-precision and macro-recall
taxator-tk binning for FAMeS SimMC
macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 100.0 0 45.2 0 0.0 0 0.0 0 0.0 0 0.0 0 35.0 0 43.1
(f) new order scenario
assigned sequences in kb
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC
100 16,000
90 80
14,000
70
12,000
60
10,000
50
8,000
40
6,000
30
4,000
20
2,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 39 of 51
family
genus
species
0
assigned sequences in kb
rank
(e) new family scenario
false (kb) true (kb) macro precision α=0.99 macro recall
Supplementary Figure S19 - MEGAN binning for FAMeS SimHC true (bp)
false (bp)
0 1 2 3 4 5 6 7 3.1 2.7
135097.7 184917.1 126186.3 105637.4 65414.1 47775.7 70368.1 110011.3 710310.2 845407.9
0.0 1240.4 17996.0 52554.0 53941.1 34408.6 42132.0 58509.7 260781.8 260781.8
macro unknown precision (bp) α=0.99 0.0 100.0 0.0 99.5 0.0 54.3 2704.9 61.0 0.0 60.5 0.0 68.1 382.3 65.3 356.5 69.9 3443.6 68.4 3443.6 72.3
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 45.2 36.9 41.5 38.6 43.0 45.6 35.8 31.3
1 1 10 12 27 36 47 47 25.7 22.6
100.0 65.7 39.8 32.6 17.4 13.1 8.4 4.7 25.9 35.2
0.0 21.3 24.2 19.7 17.2 13.3 9.4 6.5 15.9 14.0
1 2 8 12 36 52 72 96 39.7 34.9
sum true (bp)
sum false (bp)
overall prec
description
504932.0
1240.4
99.8
root+superkingdom
297237.9
124491.1
70.5
phylum+class+order
228155.2
135050.2
62.8
family+genus+species
73.1 76.4
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 4.8 4.8
0 14830 27071 58344 38434 58139 223807 660068 1080693 1080693
0 0 0 0 0 0 0 0 0 0
500,000
70 60
400,000
50
300,000
40 30
200,000
20
100,000
10 phylum
class
order
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
600,000
80
assigned sequences in bp
% macro-precision and macro-recall
90
unassigned superkingdom
rank
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 4.4 4.1
53833 71113 85694 111601 113724 140308 268770 0 791210 845043
0 0 0 14697 10562 25633 68288 121285 240465 240465
0 0 0 0 0 0 0 0 0 0
(c) new species scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.1 0.5 4.6 29.7 27.0 44.8 0.0 15.2 13.3
1 2 6 9 18 20 17 8 11.4 10.1
100.0 86.2 65.7 57.5 33.0 24.5 11.9 0.0 39.8 47.3
0.0 8.4 38.2 34.8 38.4 35.6 29.1 0.0 26.4 23.1
1 2 8 12 36 52 72 96 39.7 34.9
sum true (bp)
sum false (bp)
overall prec
description
196059
0
100.0
root+superkingdom
311019
25259
92.5
phylum+class+order
409078
215206
65.5
family+genus+species
76.7 77.8
all but unassigned all with unassigned
100.0 100.0 74.9 74.0 55.0 54.8 46.7 32.8 62.6 67.3
0.0 0.0 43.2 42.8 49.3 49.1 49.5 45.8 39.9 35.0
1 2 8 12 36 52 72 96 39.7 34.9
overall prec
description
29660
0
100.0
root+superkingdom
123849
0
100.0
phylum+class+order
942014
0
100.0
family+genus+species
100.0 100.0
all but unassigned all with unassigned
500,000
70 60
400,000
50
300,000
40 30
200,000
20
100,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
0
Supplementary Figure S19 - MEGAN binning for FAMeS SimHC rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
400,000
50
300,000
40 30
200,000
20
100,000
10 family
genus
species
0
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
60
assigned sequences in bp
% macro-precision and macro-recall
500,000
70
order
1 2 6 9 20 29 34 33 19.0 16.8
sum false (bp)
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 3.6 3.4
60008 103063 142075 171151 168553 135983 0 0 720825 780833
0 0 0 27602 29605 53126 123176 71166 304675 304675
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 99.9 98.3 92.0 73.6 55.4 0.0 0.0 59.9 64.9
(d) new genus scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.1 2.3 6.7 34.2 39.5 0.0 0.0 11.8 10.4
1 2 6 9 18 16 12 5 9.7 8.6
100.0 85.7 62.6 49.7 25.2 12.0 0.0 0.0 33.6 41.9
0.0 8.0 36.7 31.2 32.4 25.2 0.0 0.0 19.1 16.7
1 2 8 12 36 52 72 96 39.7 34.9
sum true (bp)
sum false (bp)
overall prec
description
266134
0
100.0
root+superkingdom
481779
57207
89.4
phylum+class+order
135983
247468
35.5
family+genus+species
70.3 71.9
all but unassigned all with unassigned
MEGAN binning for FAMeS SimHC 600,000
80
class
0.0 0.0 1.0 0.8 0.7 3.1 2.8 2.9 1.6 1.4
sum true (bp)
taxonomic rank
90
phylum
real bins
600,000
MEGAN binning for FAMeS SimHC
unassigned superkingdom
stdev
80
0
100
0
macro recall
90
0
Supplementary Figure S19 - MEGAN binning for FAMeS SimHC
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
pred. bins
100
taxonomic rank
macro precision α=0.99 100.0 99.9 99.7 96.4 86.9 83.4 56.8 0.0 74.7 77.9
stdev
MEGAN binning for FAMeS SimHC
MEGAN binning for FAMeS SimHC 100
0
macro unknown precision (bp) α=0.99 0 100.0 0 100.0 0 99.6 0 99.6 0 99.8 0 99.4 2676 99.5 2139 99.5 4815 99.6 4815 99.7
(b) all reference scenario
assigned sequences in bp
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S19 - MEGAN binning for FAMeS SimHC
100 90
600,000
80
500,000
70 60
400,000
50
300,000
40 30
200,000
20
100,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 40 of 51
family
genus
species
0
assigned sequences in bp
rank
(a) summary scenario
false (bp) true (bp) macro precision α=0.99 macro recall
Supplementary Figure S19 - MEGAN binning for FAMeS SimHC true (bp)
false (bp)
0 1 2 3 4 5 6 7 2.9 2.4
180596 158948 202392 200853 137188 0 0 0 699381 879977
0 1776 3264 52078 48101 57702 31996 10614 205531 205531
macro unknown precision (bp) α=0.99 0 100.0 0 99.7 0 95.4 0 82.6 0 56.7 0 0.0 0 0.0 0 0.0 0 47.8 0 54.3
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.3 4.8 9.8 33.0 0.0 0.0 0.0 6.8 6.0
1 2 5 7 11 8 3 1 5.3 4.8
100.0 69.3 35.9 30.3 8.4 0.0 0.0 0.0 20.5 30.5
0.0 13.7 33.9 27.9 17.2 0.0 0.0 0.0 13.2 11.6
1 2 8 12 36 52 72 96 39.7 34.9
sum true (bp)
sum false (bp)
overall prec
description
498492
1776
99.6
root+superkingdom
540433
103443
83.9
phylum+class+order
0
100312
0.0
family+genus+species
77.3 81.1
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (bp)
false (bp)
0 1 2 3 4 5 6 7 2.5 2.3
115421 221464 250454 197513 0 0 0 0 669431 784852
0 1776 8635 58615 102179 40254 35656 52151 299266 299266
600,000
80
500,000
70 60
400,000
50
300,000
40 30
200,000
20
100,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
90
0
rank
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 2.2 1.6
271642 343666 175618 0 0 0 0
0 1776 16711 146037 72649 34445 22964
0 0 0 0 0 0 0
519284 790926
294582 294582
0 0
25.5 36.1
(g) new class scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 38.2 0.0 0.0 0.0 0.0
1 1 3 7 9 5 2
6.4 5.5
4.5 4.0
100.0 37.7 10.0 0.0 0.0 0.0 0.0 0.0 6.8 18.5
0.0 37.7 18.1 0.0 0.0 0.0 0.0 0.0 8.0 7.0
1 2 8 12 36 52 72 96 39.7 34.9
sum true (bp)
sum false (bp)
overall prec
description
958974
1776
99.8
root+superkingdom
175618
235397
42.7
phylum+class+order
0
57409
0.0
family+genus+species
63.8 72.9
all but unassigned all with unassigned
100.0 44.3 29.2 16.7 0.0 0.0 0.0 0.0 12.9 23.8
0.0 44.3 34.1 18.1 0.0 0.0 0.0 0.0 13.8 12.1
1 2 8 12 36 52 72 96 39.7 34.9
overall prec
description
558349
1776
99.7
root+superkingdom
447967
169429
72.6
phylum+class+order
0
128061
0.0
family+genus+species
69.1 72.4
all but unassigned all with unassigned
500,000
70 60
400,000
50
300,000
40 30
200,000
20
100,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
0
Supplementary Figure S19 - MEGAN binning for FAMeS SimHC rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
400,000
50
300,000
40 30
200,000
20
100,000
10 family
genus
species
0
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
60
assigned sequences in bp
% macro-precision and macro-recall
500,000
70
order
1 1 4 7 12 6 5 2 5.3 4.8
sum false (bp)
depth
true (bp)
false (bp)
unknown (bp)
0 1 2 3 4 5 6 7 2.1 1.5
264184 381336 0 0 0 0 0 0 381336 645520
0 3355 97362 68849 114492 29700 12844 95842 422444 422444
0 0 0 17544 0 0 0 0 17544 17544
macro precision α=0.99 100.0 98.8 0.0 0.0 0.0 0.0 0.0 0.0 14.1 24.9
(h) new phylum scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 12 11 10 6 4 5 7.0 6.3
100.0 36.7 0.0 0.0 0.0 0.0 0.0 0.0 5.2 17.1
0.0 36.7 0.0 0.0 0.0 0.0 0.0 0.0 5.2 4.6
1 2 8 12 36 52 72 96 39.7 34.9
sum true (bp)
sum false (bp)
overall prec
description
1026856
3355
99.7
root+superkingdom
0
280703
0.0
phylum+class+order
0
138386
0.0
family+genus+species
47.4 60.4
all but unassigned all with unassigned
MEGAN binning for FAMeS SimHC 600,000
80
class
0.0 0.0 18.2 16.6 0.0 0.0 0.0 0.0 5.0 4.3
sum true (bp)
taxonomic rank
90
phylum
real bins
600,000
MEGAN binning for FAMeS SimHC
unassigned superkingdom
stdev
80
0
100
0
macro recall
90
0
Supplementary Figure S19 - MEGAN binning for FAMeS SimHC
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
pred. bins
100
taxonomic rank
macro precision α=0.99 100.0 99.3 53.5 0.0 0.0 0.0 0.0
stdev
MEGAN binning for FAMeS SimHC
100
assigned sequences in bp
% macro-precision and macro-recall
MEGAN binning for FAMeS SimHC
macro unknown precision (bp) α=0.99 0 100.0 0 99.4 0 84.2 1390 62.1 0 0.0 0 0.0 0 0.0 0 0.0 1390 35.1 1390 43.2
(f) new order scenario
assigned sequences in bp
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S19 - MEGAN binning for FAMeS SimHC
100 90
600,000
80
500,000
70 60
400,000
50
300,000
40 30
200,000
20
100,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 41 of 51
family
genus
species
0
assigned sequences in bp
rank
(e) new family scenario
false (bp) true (bp) macro precision α=0.99 macro recall
Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC true (bp)
false (bp)
0 1 2 3 4 5 6 7 2.3 1.8
213863.4 389604.7 144608.1 74893.3 39907.3 31822.9 59831.9 61405.0 802073.1 1015936.6
0.0 1240.4 5934.9 12043.4 14116.0 11110.7 13687.3 9917.1 68049.9 68049.9
macro unknown precision (bp) α=0.99 0.0 100.0 0.0 99.9 0.0 96.5 757.0 92.7 0.0 65.6 0.0 68.9 382.3 75.1 382.3 76.6 1521.6 82.2 1521.6 84.4
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.1 5.1 8.4 44.8 43.0 40.3 41.5 26.2 22.9
1 2 7 11 47 58 68 66 37.0 32.5
100.0 64.1 34.9 24.4 15.3 11.7 8.2 4.1 23.2 32.8
0.0 14.9 7.9 10.3 10.0 9.7 7.5 4.8 9.3 8.1
1 2 8 12 36 52 72 96 39.7 34.9
sum true (bp) 993072.9
sum false (bp) 1240.4
overall prec. 99.9
description root+superkingdom
259408.7
32094.3
89.0
phylum+class+order
153059.7
34715.1
81.5
family+genus+species
92.2 93.7
all but unassigned all with unassigned
rank
depth 0 1 2 3 4 5 6 7 3.6 3.3
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
true (bp) 47885 163775 (bp) 70870 65908 48339 72448 179113 429835 1030288 1078173
500,000
70 60
400,000
50
300,000
40 30
200,000
20
100,000
10 phylum
class
order
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
600,000
80
assigned sequences in bp
% macro-precision and macro-recall
90
unassigned superkingdom
0 0 0 0 0 2520 2520 2520
rank
depth 0 1 2 3 4 5 6 7 3.5 3.2
true (bp) 92565 257961 (bp) 152494 94075 96579 80651 239710 0 921470 1014035
false (bp)
unknown (bp)
0 0 6930 10331 34526 19686 71473 71473
0 0 0 0 0 0 0 0 0 0
(c) new species scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 0.0 28.8 30.4 46.1 0.0 15.0 13.2
1 2 7 10 26 30 31 9 16.4 14.5
100.0 84.1 62.6 48.6 30.0 23.7 13.9 0.0 37.6 45.4
0.0 6.4 26.6 28.5 30.1 31.8 26.1 0.0 21.3 18.7
1 2 8 12 36 52 72 96 39.7 34.9
sum true (bp)
sum false (bp)
608487
0
overall prec. 100.0
description root+superkingdom
343148
6930
98.0
phylum+class+order
320361
64543
83.2
family+genus+species
92.8 93.4
all but unassigned all with unassigned
1 2 6 9 32 43 55 52 28.4 25.0
100.0 97.6 70.3 59.3 55.5 49.9 43.1 28.5 57.8 63.0
0.0 2.4 28.5 28.9 30.6 34.0 35.7 33.5 27.7 24.2
1 2 8 12 36 52 72 96 39.7 34.9
sum false (bp)
overall prec.
description
375435
0
100.0
root+superkingdom
185117
0
100.0
phylum+class+order
681396
2520
99.6
family+genus+species
99.8 99.8
all but unassigned all with unassigned
500,000
70 60
400,000
50
300,000
40 30
200,000
20
100,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
0
Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC rank
depth 0 1 2 3 4 5 6 7 2.4 2.2
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
400,000
50
300,000
40 30
200,000
20
100,000
10 family
genus
species
0
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
60
assigned sequences in bp
% macro-precision and macro-recall
500,000
70
order
0.0 0.0 1.2 1.1 0.4 0.4 0.3 0.3 0.5 0.5
sum true (bp)
true (bp) 121975 341016 (bp) 232318 148392 88010 69661 0 0 879397 1001372
false (bp)
unknown (bp)
0 5857 11695 14936 39229 12419 84136 84136
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 100.0 100.0 98.6 77.2 58.7 0.0 0.0 62.1 66.8
(d) new genus scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 2.6 38.5 46.8 0.0 0.0 12.6 11.0
1 2 7 11 23 20 14 7 12.0 10.6
100.0 82.7 57.5 37.0 17.8 8.1 0.0 0.0 29.0 37.9
0.0 4.9 14.0 20.9 24.3 19.0 0.0 0.0 11.9 10.4
1 2 8 12 36 52 72 96 39.7 34.9
sum true (bp)
sum false (bp)
overall prec.
description
804007
0
100.0
root+superkingdom
468720
17552
96.4
phylum+class+order
69661
66584
51.1
family+genus+species
91.3 92.2
all but unassigned all with unassigned
taxator-tk binning for FAMeS SimHC 600,000
80
class
real bins
taxonomic rank
90
phylum
stdev
600,000
taxator-tk binning for FAMeS SimHC
unassigned superkingdom
macro recall
80
0
100
0
pred. bins
90
0
Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
stdev
100
taxonomic rank
macro precision α=0.99 100.0 100.0 100.0 100.0 88.6 86.9 65.5 0.0 77.3 80.1
macro unknown precision (bp) α=0.99 0 100.0 0 100.0 0 99.4 2139 99.5 0 99.9 0 99.9 2676 100.0 0 100.0 4815 99.8 4815 99.8
taxator-tk binning for FAMeS SimHC
taxator-tk binning for FAMeS SimHC 100
0
false (bp)
(b) all reference scenario
assigned sequences in bp
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC
100 90
600,000
80
500,000
70 60
400,000
50
300,000
40 30
200,000
20
100,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 42 of 51
family
genus
species
0
assigned sequences in bp
rank
(a) summary scenario
false (bp) true (bp) macro precision α=0.99 macro recall
Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC
0 1 2 3 4 5 6 7 2.0 1.5
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
true (bp) 243406 383364 (bp) 215052 139146 46423 0 0 0 783985 1027391
false (bp)
1898 11207 8393 19955 7389 7499 56341 56341
macro unknown precision (bp) α=0.99 0 100.0 0 99.6 0 98.5 0 92.4 0 57.8 0 0.0 0 0.0 0 0.0 0 49.8 0 56.0
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 1.9 11.7 44.5 0.0 0.0 0.0 8.3 7.3
1 1 7 10 12 11 6 3 7.1 6.4
100.0 55.2 28.4 20.6 3.6 0.0 0.0 0.0 15.4 26.0
0.0 21.8 17.2 14.9 9.5 0.0 0.0 0.0 9.1 7.9
1 2 8 12 36 52 72 96 39.7 34.9
sum true (bp)
sum false (bp)
1010134
0
400621
21498
0
34843
overall prec. 100.0
description root+superkingdom
94.9
phylum+class+order
0.0
family+genus+species
93.3 94.8
all but unassigned all with unassigned
rank
depth 0 1 2 3 4 5 6 7 1.7 1.4
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
true (bp) 197736 480003 (bp) 238652 76732 0 0 0 0 795387 993123
600,000
80
500,000
70 60
400,000
50
300,000
40 30
200,000
20
100,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
90
0
3399 11575 38737 16635 11103 9160 90609 90609
rank
depth 0 1 2 3 4 5 6 7 1.4 0.9
true (bp) 366106 538424 (bp) 102871 0 0 0 0 0 641295 1007401
false (bp)
unknown (bp)
9200 39497 11532 7485 3564 5053 76331 76331
0 0 0 0 0 0 0 0 0 0
(g) new class scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 42.8 0.0 0.0 0.0 0.0 0.0 6.1 5.3
1 1 4 9 8 6 2 2 4.6 4.1
100.0 38.5 3.8 0.0 0.0 0.0 0.0 0.0 6.0 17.8
0.0 27.4 7.3 0.0 0.0 0.0 0.0 0.0 5.0 4.3
1 2 8 12 36 52 72 96 39.7 34.9
sum true (bp)
sum false (bp)
1442954
0
102871
60229
0
16102
overall prec. 100.0
description root+superkingdom
63.1
phylum+class+order
0.0
family+genus+species
89.4 93.0
all but unassigned all with unassigned
1 1 5 7 13 9 6 3 6.3 5.6
100.0 56.7 21.9 5.3 0.0 0.0 0.0 0.0 12.0 23.0
0.0 23.4 17.7 5.9 0.0 0.0 0.0 0.0 6.7 5.9
1 2 8 12 36 52 72 96 39.7 34.9
sum false (bp)
overall prec.
description
1157742
0
100.0
root+superkingdom
315384
53711
85.4
phylum+class+order
0
36898
0.0
family+genus+species
89.8 91.6
all but unassigned all with unassigned
500,000
70 60
400,000
50
300,000
40 30
200,000
20
100,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
false (bp) true (bp) macro precision α=0.99 macro recall
0
Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC rank
depth 0 1 2 3 4 5 6 7 1.2 0.7
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
400,000
50
300,000
40 30
200,000
20
100,000
10 family
genus
species
0
false (bp) true (bp) macro precision α=0.99 macro recall
% macro-precision and macro-recall
60
assigned sequences in bp
% macro-precision and macro-recall
500,000
70
order
0.0 0.0 12.7 25.1 0.0 0.0 0.0 0.0 5.4 4.7
sum true (bp)
true (bp) 427371 562690 (bp) 0 0 0 0 0 0 562690 990061
false (bp)
unknown (bp)
27047 16168 21525 8433 0 13083 86256 86256
0 0 0 3160 0 0 0 2676 5836 5836
macro precision α=0.99 100.0 99.1 0.0 0.0 0.0 0.0 0.0 0.0 14.2 24.9
(h )new phylum scenario stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 11 11 10 7 3 5 6.9 6.1
100.0 34.1 0.0 0.0 0.0 0.0 0.0 0.0 4.9 16.8
0.0 23.0 0.0 0.0 0.0 0.0 0.0 0.0 3.3 2.9
1 2 8 12 36 52 72 96 39.7 34.9
sum true (bp)
sum false (bp)
overall prec.
description
1552751
0
100.0
root+superkingdom
0
64740
0.0
phylum+class+order
0
21516
0.0
family+genus+species
86.7 92.0
all but unassigned all with unassigned
taxator-tk binning for FAMeS SimHC 600,000
80
class
real bins
taxonomic rank
90
phylum
stdev
600,000
taxator-tk binning for FAMeS SimHC
unassigned superkingdom
macro recall
80
0
100
0
pred. bins
90
0
Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
stdev
100
taxonomic rank
macro precision α=0.99 100.0 99.5 42.6 0.0 0.0 0.0 0.0 0.0 20.3 30.3
macro unknown precision (bp) α=0.99 0 100.0 0 99.6 0 91.7 0 70.7 0 0.0 0 0.0 0 0.0 0 0.0 0 37.4 0 45.2
taxator-tk binning for FAMeS SimHC
100
assigned sequences in bp
% macro-precision and macro-recall
taxator-tk binning for FAMeS SimHC
false (bp)
(f) new order scenario
assigned sequences in bp
depth
Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC
100 90
600,000
80
500,000
70 60
400,000
50
300,000
40 30
200,000
20
100,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 43 of 51
family
genus
species
0
assigned sequences in bp
rank
(e) new family scenario
false (bp) true (bp) macro precision α=0.99 macro recall
Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011) true (kb)
false (kb)
0 1 2 3 4 5 6 7 2.1 2.1
199.2 4663.42 7936.17 2215.89 1729.56 191.38 19 1.73 16757.15 16956.35
0 2.03 2.11 25.89 28.97 13.42 11.53 0 83.95 83.95
macro unknown (kb) precision α=0.95 0 100.0 0 100.0 0 100.0 0 69.5 0 68.1 0 60.1 0 50.0 0 100.0 0 78.2 0 81.0
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 42.7 45.1 47.2 50.0 0.0 26.4 23.1
1 1 1 3 6 14 10 1 5.1 4.6
100.0 49.4 29.9 18.9 20.1 17.1 9.1 2.1 20.9 30.8
0.0 49.4 36.1 28.7 32.0 33.0 25.3 14.4 31.3 27.4
1 2 8 12 22 29 37 47 22.4 19.8
sum true (kb)
sum false (kb)
rank
overall prec.
description
9526.04
2.03
100.0
root+superkingdom
11881.62
56.97
99.5
phylum+class+order
212.11
24.95
89.5
family+genus+species
99.5 99.5
all but unassigned all with unassigned
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
false (kb)
0 1 2 3 4 5 6 7 2.5 2.2
1461.8 3788.36 4989.57 3035.46 3150.39 347.7 17.35 12.23 15341.06 16802.86
0 0 0 58.8 34.78 56.72 31.58 55.55 237.43 237.43
9,000
90
50
5,000
40
4,000
30
3,000
20
2,000
10
1,000 unassigned superkingdom
phylum
class
order
family
genus
species
false (kb) true (kb) macro precision α=0.95 macro recall
precision and recall in %
6,000
60
stdev
real bins
100.0 95.7 59.9 27.7 22.5 19.2 10.9 6.4 34.6 42.8
0.0 4.3 33.3 30.1 32.4 35.2 28.9 24.4 26.9 23.6
1 2 8 12 22 29 37 47 22.4 19.8
sum true (kb)
sum false (kb)
overall prec.
description
9038.52
0
100.0
root+superkingdom
11175.42
93.58
99.2
phylum+class+order
377.28
143.85
72.4
family+genus+species
98.5 98.6
all but unassigned all with unassigned
8,000 7,000
70
6,000
60 50
5,000
40
4,000
30
3,000
20
2,000
10
1,000
0
0
unassigned superkingdom
phylum
depth
true (kb)
false (kb)
unknown (kb)
0 1 2 3 4 5 6 7 3.5 3.1
1014.03 1534.22 1896.2 1021.99 1725.65 935.47 18.97 457.46 7589.96 8603.99
0 0 1.83 108.21 118.82 266.12 1684.92 6256.4 8436.3 8436.3
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 100.0 100.0 65.6 29.6 13.6 9.2 6.1 46.3 53.0
class
order
family
genus
species
false (kb) true (kb) macro precision α=0.99 macro recall
0
taxonomic rank
Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011)
(c) MEGAN4 (nucleotide)
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 45.6 40.6 32.0 25.9 21.2 23.6 20.7
1 1 1 3 11 17 36 38 15.3 13.5
100.0 47.0 34.5 29.4 30.1 15.3 3.8 0.2 22.9 32.5
0.0 47.0 38.5 31.7 35.2 30.8 14.5 1.0 28.4 24.8
1 2 8 12 22 29 37 47 22.4 19.8
sum true (kb)
sum false (kb)
overall prec.
description
4082.47
0
100.0
root+superkingdom
4643.84
228.86
95.3
phylum+class+order
1411.9
8207.44
14.7
family+genus+species
47.4 50.5
all but unassigned all with unassigned
Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
MEGAN4 binning for FAMeS SimMC scenario (Nature Methods 2011) using nucleotide-level alignment
depth
true (kb)
false (kb)
unknown (kb)
0 1 2 3 4 5 6 7 3.7 3.3
1198.19 749.27 1920.91 1049.16 1771.8 961.81 2.33 457.46 6912.74 8110.93
0 0 19.56 158.44 166.53 367.01 1871.14 6346.69 8929.37 8929.37
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 100.0 50.6 49.0 23.1 13.5 7.5 7.4 35.9 43.9
(d) MEGAN5 (nucleotide)
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 49.3 47.9 38.4 32.0 22.8 23.2 30.5 26.7
1 1 2 4 10 17 32 31 13.9 12.3
100.0 46.5 35.1 27.6 25.3 13.2 2.0 0.2 21.4 31.2
0.0 46.5 39.8 33.1 34.2 29.5 10.2 1.0 27.7 24.3
1 2 8 12 22 29 37 47 22.4 19.8
sum true (kb)
sum false (kb)
overall prec.
description
2696.73
0
100.0
root+superkingdom
4741.87
344.53
93.2
phylum+class+order
1421.6
8584.84
14.2
family+genus+species
43.6 47.6
all but unassigned all with unassigned
MEGAN5 binning for FAMeS SimMC scenario (Nature Methods 2011) using nucleotide-level alignment 100
100 9,000
90
6,000
60 50
5,000
40
4,000
30
3,000
20
2,000
10
1,000 unassigned superkingdom
phylum
class
order
family
genus
species
0
8,000
80
false (kb) true (kb) macro precision α=0.99 macro recall
precision and recall in %
70
assigned sequences in kb
7,000
9,000
90
8,000
80
precision and recall in %
macro recall
1 1 1 3 7 18 18 13 8.7 7.8
9,000
taxonomic rank
0
pred. bins
0.0 0.0 0.0 46.5 44.7 49.0 45.7 44.0 32.8 28.7
80
assigned sequences in kb
7,000
70
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
stdev
90
8,000
80
rank
macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 100.0 0 65.8 0 59.1 0 54.6 0 32.1 0 29.1 0 63.0 0 67.6
100
100
precision and recall in %
true (kb)
taxtor-tk binning for FAMeS SimMC scenario (Nature Methods 2011) using protein-level alignment
taxtor-tk binning for FAMeS SimMC scenario (Nature Methods 2011) using nucleotide-level alignment
0
depth
(b) taxator-tk (amino acid)
assigned sequences in kb
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011)
7,000
70
6,000
60 50
5,000
40
4,000
30
3,000
20
2,000
10
1,000
0
unassigned superkingdom
phylum
class
order
taxonomic rank
taxonomic rank
taxator-tk Supplementary Material Page 44 of 51
family
genus
species
0
assigned sequences in kb
rank
(a) taxator-tk (nucleotide)
false (kb) true (kb) macro precision α=0.99 macro recall
Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011) true (kb)
false (kb)
0 1 2 3 4 5 6 7 3.3 3.2
232.06 1647.09 3471.29 2273.77 2950.28 1710.01 18.97 385.56 12456.97 12689.03
0 0 14 182.52 227.02 284.1 1024.71 2618.9 4351.25 4351.25
macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 51.0 0 49.1 0 31.1 0 15.6 0 12.3 0 11.1 0 38.6 0 46.3
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 49.0 47.6 37.3 32.3 29.1 30.2 32.2 28.2
1 1 2 4 11 18 29 25 12.9 11.4
100.0 49.3 38.7 32.8 25.4 7.6 3.8 0.2 22.6 32.2
0.0 49.3 43.2 35.3 35.4 20.1 14.5 0.8 28.4 24.8
1 2 8 12 22 29 37 47 22.4 19.8
sum true (kb)
sum false (kb)
overall prec.
description
3526.24
0
100.0
root+superkingdom
8695.34
423.54
95.4
phylum+class+order
2114.54
3927.71
35.0
family+genus+species
74.1 74.5
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (kb)
false (kb)
0 1 2 3 4 5 6 7 3.9 3.5
1788.99 168.48 1071.86 2853.36 5824.61 1266.67 364.14 89.11 11638.23 13427.22
0 0 49.59 446.35 769.52 1107.7 796.45 443.46 3613.07 3613.07
MEGAN5 binning for FAMeS SimMC scenario (Nature Methods 2011) using protein-level alignment
real bins
0.0 44.7 36.4 34.0 37.0 41.8 32.1 14.4 34.4 30.1
1 2 8 12 22 29 37 47 22.4 19.8
sum true (kb)
sum false (kb)
overall prec.
description
2125.95
0
100.0
root+superkingdom
9749.83
1265.46
88.5
phylum+class+order
1719.92
2347.61
42.3
family+genus+species
76.3 78.8
all but unassigned all with unassigned
60 50
5,000
40
4,000
30
3,000
20
2,000
10
1,000 unassigned superkingdom
phylum
class
order
family
genus
species
false (kb) true (kb) macro precision α=0.99 macro recall
precision and recall in %
6,000
9,000 8,000 7,000
70
6,000
60 50
5,000
40
4,000
30
3,000
20
2,000
10
1,000
0
0
unassigned superkingdom
phylum
taxonomic rank
depth
true (kb)
false (kb)
unknown (kb)
0 1 2 3 4 5 6 7 3.1 2.9
843.44 1483.22 3695.28 3671.01 5540.57 345.44 237.82 170.45 15143.79 15987.23
0 0 28.78 275.38 234.56 315.09 174.47 24.78 1053.06 1053.06
0 0 0 0 0 0 0 0 0 0
macro precision α=0.99 100.0 100.0 100.0 65.5 48.0 28.6 21.7 19.0 54.7 60.3
class
order
family
genus
species
false (kb) true (kb) macro precision α=0.99 macro recall
0
taxonomic rank
Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011)
(g) CARMA (amino acid)
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 44.9 43.3 41.3 39.9 38.2 29.7 26.0
1 1 1 3 10 32 36 25 15.4 13.6
100.0 47.5 46.0 35.4 32.1 26.3 12.4 4.3 29.2 38.0
0.0 47.5 38.1 32.6 33.8 39.1 29.7 20.2 34.4 30.1
1 2 8 12 22 29 37 47 22.4 19.8
sum true (kb)
sum false (kb)
overall prec.
description
3809.88
0
100.0
root+superkingdom
12906.86
538.72
96.0
phylum+class+order
753.71
514.34
59.4
family+genus+species
93.5 93.8
all but unassigned all with unassigned
Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011) rank
depth 0 1 2 3 4 5 6 7 5.0 5.0
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
CARMA binning for FAMeS SimMC scenario (Nature Methods 2011) using protein-level alignment
true (kb)
false (kb)
unknown (kb)
0 832.36 517.19 1297.42 659.69 5715.02 7116.9
0 0 25.6 52.21 76.38 272.58 474.94
0 0 0 0 0 0 0
macro precision α=0.99 100.0 100.0 100.0 67.6 49.6 49.3 49.1
16138.58 16138.58
901.71 901.71
0 0
69.2 73.6
(h) PhyloPythiaS
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 0.0 45.5 44.9 48.2 46.0
1 1 1 3 6 6 6
100.0 49.9 50.8 54.5 37.4 33.8 23.3
0.0 49.9 41.1 40.0 34.2 40.7 37.4
1 2 8 12 22 29 37
30.8 26.4
3.8 3.4
41.6 49.9
40.6 34.8
18.3 15.9
sum true (kb)
sum false (kb)
overall prec.
description
1664.72
0
100.0
root+superkingdom
2474.3
154.19
94.1
phylum+class+order
12831.92
747.52
94.5
family+genus+species
94.7 94.7
all but unassigned all with unassigned
PhyloPythiaS binning for FAMeS SimMC scenario (Nature Methods 2011)
100
100 9,000
90
6,000
60 50
5,000
40
4,000
30
3,000
20
2,000
10
1,000 unassigned superkingdom
phylum
class
order
family
genus
species
0
8,000
80
false (kb) true (kb) macro precision α=0.99 macro recall
precision and recall in %
70
assigned sequences in kb
7,000
9,000
90
8,000
80
precision and recall in %
stdev
100.0 44.7 45.9 37.3 39.7 30.3 13.5 2.2 30.5 39.2
80
assigned sequences in kb
precision and recall in %
7,000
70
0
macro recall
1 1 3 5 19 50 93 135 43.7 38.4
90
8,000
80
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
pred. bins
0.0 0.0 40.4 46.0 35.1 28.0 23.0 15.1 26.8 23.4
100 9,000
90
rank
stdev
CARMA binning for FAMeS SimMC scenario (Nature Methods 2011) using nucleotide-level alignment
100
0
macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 48.1 0 38.2 0 23.6 0 13.7 0 6.8 0 2.5 0 33.3 0 41.6
(f) CARMA (nucleotide)
assigned sequences in kb
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011)
7,000
70
6,000
60 50
5,000
40
4,000
30
3,000
20
2,000
10
1,000
0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 45 of 51
family
genus
species
0
assigned sequences in kb
rank
(e) MEGAN5 (amino acid)
false (kb) true (kb) macro precision α=0.99 macro recall
Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
true (kb)
false (kb)
0 1 2 3 4 5 6 7 4.4 0.9
5974.51 78.01 208.37 144 239.27 78.26 1.01 527.7 1276.62 7251.13
0 0 0 17.5 61.94 105.58 578.56 9025.58 9789.16 9789.16
macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 45.2 0 36.6 0 13.2 0 8.2 0 4.3 0 2.1 0 29.9 0 38.7
(i) Kraken
stdev
pred. bins
macro recall
stdev
real bins
0.0 0.0 41.4 44.9 28.5 25.2 17.1 13.1 24.3 21.3
1 1 3 5 21 43 83 123 39.9 35.0
100.0 32.5 32.6 28.2 36.7 32.6 20.6 11.5 27.8 36.8
0.0 32.5 36.0 33.5 38.6 42.5 37.9 30.9 36.0 31.5
1 2 8 12 22 29 37 47 22.4 19.8
sum true (kb)
sum false (kb)
overall prec.
description
6130.53
0
100.0
root+superkingdom
591.64
79.44
88.2
phylum+class+order
606.97
9709.72
5.9
family+genus+species
11.5 42.6
all but unassigned all with unassigned
Kraken binning for FAMeS SimMC scenario (Nature Methods 2011) 100 9,000
90
precision and recall in %
7,000
70
6,000
60 50
5,000
40
4,000
30
3,000
20
2,000
10
1,000
0
unassigned superkingdom
phylum
class
order
family
genus
species
assigned sequences in kb
8,000
80
false (kb) true (kb) macro precision α=0.99 macro recall
0
taxonomic rank
taxator-tk Supplementary Material Page 46 of 51
Supplementary Figure S22 - Binning for partitioned cow rumen sample consistent (kb)
inconsistent (kb)
unknown (kb)
0 1 2 3 4 5 6 7 1.8 1.0
144478 102968 25730 2256 13988 2400 5552 0 152894 297372
0 154 3872 1350 1810 964 1090 10670 19910 19910
0 0 22 54 42 104 132 890 1244 1244
macro consistency α=0.99 100.0 99.9 66.6 61.1 55.4 52.1 52.6 0.0 55.4 61.0
pred. bins
macro recall
stdev
cons. bins
0.0 1 0.0 1 20.9 13 18.7 28 17.1 62 23.2 167 36.6 572 0.0 1254 16.6 299.6 14.5 262.3
100.0 42.6 13.0 11.1 9.7 9.0 9.2 0.0 13.5 24.3
0.0 13.9 5.8 4.7 4.5 4.8 5.1 0.0 5.5 4.8
1 2 30 52 99 198 446 926 250.4 219.3
stdev
sum true (kb)
sum false overall (kb) consist.
description
350414
154
100.0
root+superkingdom
41974
7032
85.7
phylum+class+order
7952
12724
38.5
family+genus+species
88.5 93.7
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
80
140,000
80
70
120,000
60
100,000
% macro-consistency
50
80,000
40
60,000
30
40,000
20
20,000
10 class
order
family
genus
species
unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99
% macro-consistency
90
90
assigned sequences in kb
160,000
phylum
inconsistent (kb)
unknown (kb)
0 1 2 3 4 5 6 7 3.0 0.9
192034 23020 17210 2620 19082 2190 7412 16336 87870 279904
0 66 2812 1202 3646 2166 4932 21246 36070 36070
0 0 12 6 14 82 216 2222 2552 2552
rank
depth
consistent (kb)
inconsistent (kb)
unknown (kb)
0 1 2 3 4 5 6 7 2.7 1.8
87760 65560 35352 2242 42082 2802 12764 31322 192124 279884
0 116 2802 1090 3308 3220 6726 18358 35620 35620
0 0 34 42 66 178 436 2266 3022 3022
1 2 30 52 104 221 578 1330 331.0 289.8
sum true (kb)
sum false overall (kb) consist.
description
238074
66
100.0
root+superkingdom
38912
7660
83.6
phylum+class+order
25938
28344
47.8
family+genus+species
70.9 88.6
all but unassigned all with unassigned
140,000
60
100,000
50
80,000
40
60,000
30
40,000
20
20,000
10
0
(c) MEGAN4 (nucleotide)
unassigned superkingdom
phylum
class
order
family
genus
species
unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99
0
pred. bins
macro recall
stdev
cons. bins
100.0 43.8 24.7 19.7 15.5 13.6 12.6 11.7 20.2 30.2
0.0 26.2 16.9 13.9 11.9 9.2 8.0 7.6 13.4 11.7
1 3 27 48 88 168 295 535 166.3 145.6
sum true (kb)
sum false overall (kb) consist.
description
218880
116
99.9
root+superkingdom
79676
7200
91.7
phylum+class+order
46888
28304
62.4
family+genus+species
84.4 88.7
all but unassigned all with unassigned
Supplementary Figure S22 - Binning for partitioned cow rumen sample rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
consistent (kb)
inconsistent (kb)
unknown (kb)
0 1 2 3 4 5 6 7 2.5 2.2
36062 90540 59686 6054 46384 4208 13258 24518 244648 280710
0 220 4074 1614 4808 4130 6756 12322 33924 33924
0 0 26 24 118 358 778 2588 3892 3892
macro consistency α=0.99 100.0 99.9 67.2 51.0 36.3 35.1 36.7 46.2 53.2 59.1
(d) MEGAN5 (amino acid)
pred. bins
macro recall
stdev
cons. bins
0.0 1 0.0 1 25.6 12 27.9 25 27.1 52 21.7 119 20.3 203 20.5 347 20.4 108.4 17.9 95.0
100.0 72.9 25.3 20.1 15.9 13.1 12.3 11.5 24.4 33.9
0.0 15.9 17.8 13.4 11.4 8.7 7.9 8.6 11.9 10.5
1 2 26 44 79 140 218 343 121.7 106.6
stdev
sum true (kb)
sum false overall (kb) consist.
description
217142
220
99.9
root+superkingdom
112124
10496
91.4
phylum+class+order
41984
23208
64.4
family+genus+species
87.8 89.2
all but unassigned all with unassigned
MEGAN binning for partitioned cow rumen sample using protein-level alignment
100
100 160,000
90
60
100,000
50
80,000
40
60,000
30
40,000
20
20,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
0
unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99
% macro-consistency
120,000
assigned sequences in kb
70
160,000
90
140,000
80
% macro-consistency
cons. bins
0.0 7.1 6.0 4.5 4.3 4.1 4.7 5.0 5.1 4.5
120,000
MEGAN binning for partitioned cow rumen sample using nucleotide-level alignment
0
stdev
taxonomic rank
0.0 1 0.0 1 24.6 12 27.7 25 26.3 51 21.1 132 20.4 264 21.9 564 20.3 149.9 17.8 131.3
stdev
macro recall 100.0 33.2 15.4 13.9 12.3 11.0 11.0 10.4 15.3 25.9
70
0
Supplementary Figure S22 - Binning for partitioned cow rumen sample
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
pred. bins
0.0 1 0.0 1 27.8 15 21.7 31 17.3 71 14.8 188 19.3 611 29.0 1956 18.6 410.4 16.2 359.3
stdev
160,000
taxonomic rank
macro consistency α=0.99 100.0 99.9 67.4 51.9 39.6 34.9 33.6 38.9 52.3 58.3
macro consistency α=0.99 100.0 99.9 47.1 39.6 31.9 31.6 31.1 31.0 44.6 51.5
100
100
unassigned superkingdom
consistent (kb)
CARMA binning for partitioned cow rumen sample using protein-level alignment
CARMA binning for partitioned cow rumen sample using nucleotide-level alignment
0
depth
(b) CARMA (amino acid)
assigned sequences in kb
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S22 - Binning for partitioned cow rumen sample
80
140,000
70
120,000
60
100,000
50
80,000
40
60,000
30
40,000
20
20,000
10 0
unassigned superkingdom
taxonomic rank
phylum
class
order
taxonomic rank
taxator-tk Supplementary Material Page 47 of 51
family
genus
species
0
assigned sequences in kb
rank
(a) CARMA (nucleotide)
unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99
Supplementary Figure S22 - Binning for partitioned cow rumen sample consistent (kb)
inconsistent (kb)
unknown (kb)
0 1 2 3 4 5 6 7 1.9 1.2
122146 115428 39828 1676 23582 2524 8938 1348 193324 315470
0 62 1152 334 726 198 198 44 2714 2714
0 0 4 28 28 100 94 88 342 342
macro consistency α=0.99 100.0 100.0 87.7 80.2 78.3 79.3 76.2 78.0 82.8 85.0
stdev
pred. bins
macro recall
stdev
cons. bins
0.0 0.0 16.9 17.0 20.2 19.8 35.9 37.4 21.0 18.4
1 1 7 14 16 50 110 123 45.9 40.3
100.0 56.8 16.4 13.7 11.7 10.3 9.8 8.6 18.2 28.4
0.0 7.3 12.3 11.3 10.8 8.8 7.8 6.7 9.3 8.1
1 2 22 34 56 84 94 103 56.4 49.5
sum true (kb)
sum false overall (kb) consist.
description
353002
62
100.0
root+superkingdom
65086
2212
96.7
phylum+class+order
12810
440
96.7
family+genus+species
98.6 99.1
all but unassigned all with unassigned
rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
consistent (kb)
inconsistent (kb)
unknown (kb)
0 1 2 3 4 5 6 7 1.8 1.6
36398 162352 63942 5920 34474 2586 8344 1136 278754 315152
0 144 1604 410 600 92 228 8 3086 3086
0 0 2 24 30 72 82 78 288 288
taxator-tk binning for partitioned cow rumen sample using nucleotide-level alignment
120,000
60
100,000
50
80,000
40
60,000
30
40,000
20
20,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99
% macro-consistency
140,000
70
cons. bins
0.0 14.5 15.7 13.1 12.4 9.4 8.1 7.8 11.6 10.1
1 2 17 26 37 51 55 49 33.9 29.8
sum true (kb)
sum false overall (kb) consist.
description
361102
144
100.0
root+superkingdom
104336
2614
97.6
phylum+class+order
12066
328
97.4
family+genus+species
98.9 99.0
all but unassigned all with unassigned
80
140,000
70
120,000
60
100,000
50
80,000
40
60,000
30
40,000
20
20,000
10
0
0
unassigned superkingdom
phylum
class
depth
consistent (kb)
inconsistent (kb)
unknown (kb)
0 1 2 3 4 5 6 7 2.0 2.0
0 148138 65136 9468 31734 3990 6144 3708 268318 268318
0 2810 24220 9930 6490 1438 1078 538 46504 46504
0 0 4 568 828 708 1072 524 3704 3704
macro consistency α=0.99 100.0 100.0 66.6 55.3 56.2 58.7 62.8 64.7 66.3 70.6
order
family
genus
species
unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99
0
taxonomic rank
Supplementary Figure S22 - Binning for partitioned cow rumen sample
(g) PhyloPythiaS
stdev
pred. bins
macro recall
stdev
cons. bins
0.0 0.0 15.1 22.0 21.0 18.9 18.5 29.0 17.8 15.6
1 1 4 10 19 30 33 64 23.0 20.3
100.0 81.6 31.0 21.0 12.8 10.4 9.2 8.0 24.9 34.3
0.0 17.5 14.7 10.7 4.9 4.1 4.2 4.6 8.7 7.6
1 2 7 13 25 39 45 67 28.3 24.9
sum true (kb)
sum false overall (kb) consist.
description
296276
2810
99.1
root+superkingdom
106338
40640
72.3
phylum+class+order
13842
3054
81.9
family+genus+species
85.2 85.2
all but unassigned all with unassigned
Supplementary Figure S22 - Binning for partitioned cow rumen sample rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
consistent (kb)
inconsistent (kb)
unknown (kb)
0 1 2 3 4 5 6 7 3.9 0.2
265616 2262 728 190 1116 254 1378 17642 23570 289186
0 4 504 434 690 684 2736 23798 28850 28850
0 0 0 2 8 8 34 438 490 490
macro consistency α=0.99 100.0 99.9 42.6 41.3 34.5 33.5 32.4 33.3 45.4 52.2
(h) Kraken
pred. bins
macro recall
stdev
cons. bins
0.0 1 0.0 1 25.6 18 22.9 33 18.4 77 15.3 195 19.5 661 28.2 1953 18.6 419.7 16.2 367.4
100.0 21.0 10.8 10.5 9.9 9.3 9.7 9.9 11.6 22.6
0.0 1.2 5.8 5.3 5.0 4.3 4.6 5.2 4.5 3.9
1 2 30 52 110 233 640 1461 361.1 316.1
stdev
sum true (kb)
sum false overall (kb) consist.
270140
4
100.0
root+superkingdom
2034
1628
55.5
phylum+class+order
19274
27218
41.5
family+genus+species
45.0 90.9
all but unassigned all with unassigned
description
Kraken binning for partitioned cow rumen sample
PhyloPythiaS binning for partitioned cow rumen sample 100
100 160,000
90
60
100,000
50
80,000
40
60,000
30
40,000
20
20,000
10 unassigned superkingdom
phylum
class
order
family
genus
species
0
unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99
% macro-consistency
120,000
assigned sequences in kb
70
160,000
90
140,000
80
% macro-consistency
stdev
100.0 74.1 20.9 16.3 14.0 11.8 10.5 9.8 22.5 32.2
160,000
taxonomic rank
0
macro recall
1 1 5 10 9 32 27 59 20.4 18.0
90
assigned sequences in kb
% macro-consistency
80
rank
pred. bins
0.0 0.0 6.2 14.1 15.4 22.0 26.8 37.7 17.5 15.3
100 160,000
90
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
stdev
taxator-tk binning for partitioned cow rumen sample using protein-level alignment
100
0
macro consistency α=0.99 100.0 100.0 94.7 88.6 89.3 78.6 81.2 78.7 87.3 88.9
(f) taxator-tk (amino acid)
assigned sequences in kb
unassigned superkingdom phylum class order family genus species avg/sum avg/sum
depth
Supplementary Figure S22 - Binning for partitioned cow rumen sample
80
140,000
70
120,000
60
100,000
50
80,000
40
60,000
30
40,000
20
20,000
10 0
unassigned superkingdom
phylum
class
order
taxonomic rank
taxonomic rank
taxator-tk Supplementary Material Page 48 of 51
family
genus
species
0
assigned sequences in kb
rank
(e) taxator-tk (nucleotide)
unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99
Supplementary Figure S23: Parallel speedup of program taxator
taxator Assignment Speedup 14 12
speedup
10 8
speedup
6 4 2 0 1
2
5
10
15
20
number or CPU cores Execution time analysis with taxator for parallelized processing with multiple CPU cores. Taxonomic placement of sequence segments with taxator on input alignments for sequences of length 1000 bp (syn1000 data-set aligned against mRefSeq47 with LAST). The speedup was calculated using wall clock time for a parallelized run relative to serial execution with one CPU thread. With multiple threads, there is always one producer thread (consumer-producer model). Thus for more than two threads, multiple consumers work on the input data in parallel. An approximate linear scale-up was observed up to 15 threads and saturation effects appear when using 20 CPU cores on our system.
, with : serial execution time : execution time using p threads and CPU cores
taxator-tk Supplementary Material Page 49 of 51
Supplementary Figure S24: Effect of input sequence length and segmentation on taxator-tk processing time. (a)
taxator processing time time (min. wall clock)
with segmentation all reference new species new genus new family new order new class new phylum
180 150 120 90 60 30 0 100
500
1000
individual sequence length in bp (b)
taxator processing time time (min. wall clock)
without segmentation all reference new species new genus new family new order new class new phylum
180 150 120 90 60 30 0 100
500
1000
individual sequence length in bp We processed approximately the same number of sequences of length 100, 500 and 1000 bp with taxator-tk (syn100,syn500,syn1000), once with the segmentation procedure being enabled (a) and once with segmentation disabled (b). The run-time increases for both cases are approximately linear with the input length, where the slope depends on the completeness of the reference sequence data. With all reference data available, the run-time increases more than linear, as there is no segmentation of queries during computations. For all other cases, segmentation substantially decreases the execution time.
taxator-tk Supplementary Material Page 50 of 51
Supplementary Figure S25: Example GFF3 output of taxator ##gff-version 3 contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk
Query Generator identifier
sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature
102 155 201 225 246 326 486 555 633 670 786 886 958
121 194 220 243 301 471 554 616 651 745 809 932 980
1 0.91 1 1 1 1 0.60 0.63 1 0.89 0.89 1 1
. . . . . . . . . . . . .
. . . . . . . . . . . . .
seqlen=1012;tax=1224:19;ival=0.5 seqlen=1012;tax=2:32-1;ival=0.8 seqlen=1012;tax=40324:20-2;ival=0 seqlen=1012;tax=316277:19-1;ival=1 seqlen=1012;tax=731:38-1224;ival=0.72 seqlen=1012;tax=338:87-1224;ival=0.98 seqlen=1012;tax=1224:59-2;ival=0.67 seqlen=1012;tax=32008:43-1224;ival=0.86 seqlen=1012;tax=876:19-1;ival=1 seqlen=1012;tax=31998:60-1;ival=0.89 seqlen=1012;tax=256618:23-2;ival=0.2 seqlen=1012;tax=644:33-2;ival=0.67 seqlen=1012;tax=347:22-1;ival=1
Type
Begin End Score Strand Phase Query length Taxonomic range and support Interpolation value
Query segment assignments calculated by the program taxator (version 1.1.1) are generated in standard GFF3 format. Each tab-separated field holds the information which is named in the bottom description. The score measures the assignment quality and is under ongoing improvement. Strand and phase contain a dot as placeholder as they are invalid GFF3 fields for this output. The last column holds data in a key-value scheme and includes the query sequence length, a taxonomic prediction range of the form low:support-high where low/specific (node X in Fig. 2a) and high/general (node R in Fig. 2a) are NCBI taxon IDs. The included interpolation value ranging from zero (low) and one (high) can be used to determine an approximate position in the given taxonomic range. As it might become necessary for post-processing applications such as whole sequence binning, more information can be added in the last column while preserving backward compatibility.
taxator-tk Supplementary Material Page 51 of 51