Supplementary Methods for Taxator-tk: Precise ... - HZI OpenRepository

0 downloads 0 Views 7MB Size Report
which can be any subsequence of the full query sequence (i.e. the query can be ..... prec. unassigned superkingdom phylum class order family genus species. 0.
Supplementary Methods for Taxator-tk: Precise Taxonomic Assignment of Metagenomes by Fast Approximation of Evolutionary Neighborhoods I.

Taxonomic Assignment of Sequence Segments Here we describe in detail the individual steps and the run-time properties of the

algorithm which is implemented in the program taxator, the second stage of the overall binning workflow using taxator-tk (Fig. 2b). We propose the realignment placement algorithm (RPA) for the taxonomic assignment of a query segment q, which can be any subsequence of the full query sequence (i.e. the query can be a read, contig, scaffold or a complete genome sequence). The algorithm constitutes two pairwise alignment passes and in each, q is aligned to segments of nucleotide reference sequences. It aims at identifying as many as possible taxa of the prediction clade (node R in Fig. 2a) without explicitly resolving its phylogenetic structure. 1. Among the given set of homologous segments constructed from overlapping alignments before application of the RPA, we define s to be the most similar segment to q, i.e. the one with the best local alignment score of all reference segments. In the first pass, all segments are aligned against s ( alignments). The resulting pairwise scores, our implementation uses the edit distance (mismatches + gaps), define an ordering among all segments or their corresponding taxa. The distinction between segments and associated taxa will be neglected in the following for better readability. All taxa which are less distant to s than q, including s itself, are added to an empty set M which holds all identified taxa of the prediction clade. The first more distant taxon than q is defined to be the outgroup segment o (Fig. 2c) and used as the alignment target in the following second and last pass in which similar taxa to o are added M. 2. We align all segments, including q, against o and rank the resulting scores. Then we add all taxa to M which have a lower score than q. With some fine-tuning, we chose to also add taxa with a higher score than q, within a small range accounting for erroneous scores, because o and q can be very distant homologs with noisy alignment. The width of this error band is determined on a per-segment basis as a linear score function of the taxonomic disorder in the alignment scores and not a universal or configurable run-time parameter. We interpret a rank disorder (e.g. a known family member of o being more similar to o than a

taxator-tk Supplementary Material Page 1 of 51

corresponding species member segment) as a discordance between gene tree and taxonomy and proportionally scale the effective score of q to enlarge M by taxa which are slightly more distant to o than q. This second pass requires new alignments, or less if some segments are identical to either q or s. If multiple best references (s) or outgroup segments (o) were present in these two passes with identical alignment scores, the calculations are repeated for every such segment in order to produce stable output. We reduced the additional computational effort in our implementation by detecting frequent identical segments and uninformative homologs. The final assignment taxon ID of q is the lowest common ancestor (LCA) of the taxa in M, or none if no outgroup had been found. The theoretical run-time in the segment assignment algorithm measured in units “number of pairwise alignments” is in

and about

, where

denotes the number of

homologous segments. The run-time complexity for a single pairwise alignment is and scales quadratically with the segment length . Therefore the total run-time complexity per segment is

and the total worst-case run-time for the entire

query sequence can be bounded above by number

where

denotes the maximum

of segment homologs among all query segments and

is the total length

of the query sequence. Thus, the run-time for the entire sample in the worst case scales linearly with the amount of sequence data (bp) and linearly with the number of homologs but quadratically with the length of the individual segments. Segments with an excessive number of homologs, most often short segments of abundant and uninformative regions, have a negative impact on the program run-time. We currently limit the number of homologs per query to the top-scoring 50 by default in our pipeline scripts (configurable run-time parameter in program alignments-filter or directly in the local alignment search program), before passing them to taxator. Other tested values gave similar results and the parameter, if changed, should be chosen based on hardware limitations. If this parameter is set lower, then the number of reference segments drops below a critical value such that no outgroup can be determined for some q and which therefore remain unassigned (but without impacting the taxon ID of other segments). II.

Consensus Binning Algorithm Due to sparse segments and taxonomic assignment thereof with taxator in stage

two of the workflow (Fig. 1b), a final processing step (Fig. 1c) is required to determine

taxator-tk Supplementary Material Page 2 of 51

a taxon ID for the entire query sequence. Therefore we have implemented a simplistic, weighted consensus assignment scheme in the program binner, which optionally permits to apply custom constraints, e.g. the minimum percentage identity (PID) for classification at the species level or the removal of taxa with low counts in the whole sample. However, there are currently only two mandatory run-time parameters to control the actual post-processing consensus algorithm. First we define the support of a query segment to be the number of total identical positions to the best reference segment. The first run-time parameter specifies the minimum combined support at any rank (50 positions by default) and serves to ignore false predictions caused by short and often noisy segments. The other parameter specifies the minimum percentage of the summed support (70% by default) to allow a majority taxon to outvote a contradicting minority. Inconsistent taxa below this support value are resolved by the LCA operation until the threshold is reached. Probably due to the conservative nature of the RPA, we found those two parameters to have minimal impact on the binning results in practice. The output of taxator additionally includes the taxa in the evolutionary neighborhood, a score reflecting the agreement between the segment tree and the taxonomy, as well as a score for interpolation of the query-branch location between the R and X nodes of Fig. 2. We provide Python language bindings for processing with other applications. III.

Taxonomy and Phylogeny Taxator-tk assumes that the NCBI taxonomy used for the assignment correctly

captures the evolutionary process of speciation, although we know that the categorization of some taxa might be inconsistent with their evolution. If the phylogenetic information inferred from similarity scores disagrees with the taxonomic structure, assignments are made to a consistent higher rank. For instance horizontal gene transfer and upstream sequence misassembly can cause multiple similar copies of a sequence to be distributed across unrelated taxa. In case a query sequence cannot be traced by the algorithm to have evolved with either copy, it is usually assigned to the LCA of these clades. However, if the donor clade is unknown, the query may also be assigned to the recipient clade and the horizontal transfer or misassembly can go undetected. Thus assignment errors caused by the evolution of genes, upstream technical errors or taxonomy cannot always be eliminated in this framework. It remains to be assessed whether the use of an alternative microbial taxonomy such as the GreenGenes 1 or the SILVA2 taxonomy would improve on the

taxator-tk Supplementary Material Page 3 of 51

taxonomic assignment. IV. Comparison and Innovations Taxator-tk shares some ideas with previous programs: Starting with MEGAN3, which uses local alignments scores to define a "neighborhood of related sequences" and then makes a taxonomic estimate which is the LCA of the corresponding taxa. This neighborhood threshold is a percentage of the local alignment score and can be interpreted to reflect the rate of evolution within a taxonomic group. Its value is empirical and lacks stronger justification. The neighborhood definition has been improved in taxator-tk and other programs. To our knowledge, SOrt-ITEMS4 was the first algorithm to use the logic of realignment to the best reference (termed reciprocal similarity) for read assignment but is restricted to protein level alignment and is implemented as a wrapper around (the legacy C version of) BLAST+3. Protein-level alignment in general triples the run-time of the local alignment step (translation into three frame shifts) and cannot make use of faster nucleotide aligners. SOrt-ITEMS also uses fixed similarity thresholds in terms of percentage identity to define universal levels of conservation within taxonomic groups assuming the same rate of evolution for different genetic regions and clades. Furthermore SOrt-ITEMS was primarily designed for reads and if it performs well for longer sequences, its run-time is expected to increase proportionally with input sequence lengths. Both follow-up programs taxator-tk and CARMA36 adopted the logic of reciprocal alignment, extended it and removed the assumption of universal conservation levels. CARMA3 accounts for a heterogeneous rate of evolution for different genetic regions. The initial identification of similar sequences in the reference can be based on nucleotide or protein BLAST search or profile Hidden Markov Models with HMMER7. In BLAST mode, CARMA3, like SOrt-ITEMS, uses a single reciprocal alignment search and then extra or interpolates alignment scores to select a taxonomic rank for prediction. It therefore assumes a parameterized model for the conservation level at a taxonomic rank: a linear function which is fitted to the observed local alignment scores. With taxator-tk, we use a non-parametric score ranking algorithm, instead. Also, to our knowledge, we provide the first algorithm to determine a proper outgroup and to sparsify the input data being able to assign distinct regions on the query sequence to possibly different taxonomic groups. Also, we at most assume segment-wise constant

taxator-tk Supplementary Material Page 4 of 51

rates of evolution (equally long branches from a common ancestor). This makes the major algorithmic component parameter-less and robust in itself, independent of the individual segment sizes. Through the sparsification procedure it incorporates structural rearrangements among distant relatives and scales better with the length of the input sequences. The individual segment assignments allow for a robust consensus voting scheme for the assignment of entire sequence fragments. The segment-specific classifications could also be used to detect the inconsistent taxonomic composition of an input sequence which can be caused by horizontal gene transfers events (HGTs) and assembly errors. Different from most previous approaches, taxator-tk was developed for and tested using fast nucleotide sequence local alignments instead of protein sequence alignments, although for the local alignments in stage 1 of the workflow both can be used. Our comparisons, however, suggest that the additional computations which are required for protein-level homology search do not considerably improve the results with taxator-tk. Thus, taxonomic binning of a metagenome sample with taxator-tk requires no more than specification of reference sequences, their taxonomic affiliations and an aligner like BLAST or LAST8. On the implementation side, all workflow steps for taxonomic assignment with taxator-tk are designed in a modular way making it easy to save, compress, reuse or recompute results. The computation-intensive classification of segments in taxator is run in parallel on many CPU cores while at the same time using the open source C++ algorithm library SeqAn 9 for fast pairwise alignment. V.

Performance Measures As metagenome datasets can have varying taxonomic composition in terms of

which taxa are present and their relative abundances, this needs to be taken into consideration in evaluating taxonomic assignment methods. If an algorithm performs better for some clades than for others at a given rank we call it taxonomically biased. Oftentimes a classifier is biased, if it uses parameters that fit one clade better than another. This can be the case if the parameters were chosen to give good overall assignment accuracy (low total number of false predictions) on training data with biased taxonomic composition. Such a method is optimized to perform well for the abundant taxa of these particular training data and will not generalize well when applied to a sample of different taxonomic structure and abundances. To account for uneven taxonomic composition in evaluation datasets and to obtain comparable performance estimates across datasets of different taxonomic composition, we used

taxator-tk Supplementary Material Page 5 of 51

as the primary evaluation measure the bin-averaged precision (or positive predictive value), also known as macro-precision. (Equation V.1) where

is number of all predicted bins and (Equation V.2)

True positives

are the correct assignments to the

bin and false positives

the incorrect assignments to the same bin. The macro-precision is the fraction of correct sequence assignments over all assignments to a given taxonomic bin, averaged over all predicted bins for a given rank. For falsely predicted bins which do not occur in the data, the precision is therefore zero. This value reflects how trustworthy the bin assignments are on average from a user’s perspective, as it is averaged overall predicted bins. In addition to the macro-precision, we report the raw numbers of true and false predictions for every cross-validation, as well as a quick overall precision for pooled ranks. This overall precision is most informative for species+genus+family and reports the fraction of true classifications among the predictions for all these ranks in a single pooled bin. (Equation V.3)

We measure the taxonomic bias of a method in terms of the standard deviation over all individual bin precisions. (Equation V.4) where (Equation V.5) The standard deviation is small if all predicted bins have a similar precision. A universally good method should have a high macro-precision with a low taxonomic

taxator-tk Supplementary Material Page 6 of 51

bias. The recall (or sensitivity) is a measure of completeness of a predicted bin and, analogously, the macro-recall is the fraction of correctly assigned sequences of all sequences belonging to a certain bin, averaged over all existing bins in the test data10. (Equation V.6) where

is the number of all existing bins in the test data and (Equation V.7)

False negatives (

) are the assignments belonging to the

th

bin but which where

classified to another bin or left unassigned. The macro-recall reflects how well the classifier works more from a developer’s perspective than from the user's perspective, as it is usually not known which predicted bins correspond to existing ones and which do not. VI. Low-abundance Filtering The number of predicted bins at each rank can be quite large, at most the number of known taxa in the taxonomy and reference sequence data. When noise is considered to occur evenly distributed across this large output space, bins with few assigned sequences are more likely to be falsely identified, than larger bins (the chance to independently classify the same bin by chance n times is

, where

is

the number of possible bins). Since the macro precision is an average over all predicted bins, it is heavily affected by bins with few sequences assigned. As a result, classifiers that predict clades present at low frequencies in the sample score badly under this measure. To correct for this effect, we define a truncated average precision ignoring the least abundant predicted bins and consider only the largest predicted bins constituting a minimum fraction

of the total assignments (equal size bins

are also included). This modification acts as a noise filter and accounts for different behavior of classifiers without explicitly considering the size of the model space or the number of existing species in the actual sample. We set evaluations.

taxator-tk Supplementary Material Page 7 of 51

to 0.99 for our

VII. Cross-validation Despite

the

limitations

of

simulated

metagenomes,

which

incorporate

assumptions about sequencing error rates or species abundance distributions, it is very informative to evaluate taxonomic assignment methods on simulated sequence data as real metagenome samples lack taxon IDs for evaluation. Our canonical way of evaluating a method on simulated data is a version of leave-one-out crossvalidation: Each query sequence is classified by removing all identical or related sequences up to a given rank from the reference collection: For example, to assess the performance in assigning query sequences from a new species, all sequences belonging to this species are removed from the reference sequence collection for the classifier. Performance measures (macro-recall, macro-precision), along with other statistics (true/false/unassigned data, overall precision, bin counts) which are available in the coupled tables, were normally calculated in units of the number of assigned basepairs or the number of assigned sequences, if these had comparable lengths. These values were calculated for all ranks (species, genus, family, order, class, phylum, domain/superkingdom) for seven simulations: either all reference data was used (per query) or all data from the query species, genus, family, order, class or phylum was removed from the reference data prior to classification. The assignments of these seven cross-validation experiments were averaged for a combined performance summary with standard measures. VIII. Consistency Analysis In order to evaluate the predictions for real metagenome samples where no underlying correct taxon IDs are known for the sequences, we assigned sequences linked by assembly and calculated an assignment consistency value. We split long contigs into multiple pieces and classified each piece independently. Assuming that the sequence assembly was correct in the first place, contradicting assignments of pieces that originate from the same contig represent false assignments. This unveils part of the errors made by a particular method but some, if not the majority, will go undetected because the actual ID stays unknown and the assignments for a contig can be consistently wrong. Hence these results are generally more difficult to interpret than those from simulated data. IX. Sequence Homology Search via Local Alignment In the course of evaluation we created many local alignments as input to the

taxator-tk Supplementary Material Page 8 of 51

taxonomic assignment programs CARMA3, MEGAN4/5 and taxator-tk. The nucleotide alignments were mostly generated using the alignment program LAST (version 320) because it ran faster without noticeable differences in the output alignments than BLAST+/blastn (version 2.2.28+). The protein-level alignments which we used in our evaluations were generated with BLAST+/tblastx (version 2.2.28+) because we wanted to compare with identical nucleotide reference sequences. We support and tested with different alignment programs for the fact that BLAST is standard and easy to parallelize whereas LAST has a faster algorithm but high memory requirements. It ran with comparable speed to the BLAST+/megablast algorithm which has a limited sensitivity and in practice resulted in a two to four times reduced amount of query sequences being aligned and classified. For a detailed comparison of alignment programs and how LAST compares to other programs such as RAPSEARCH211 and BLAT12, consider Niu et al.13 and Darling et al.14. In our evaluations, LAST was roughly 50 to 200 times faster than BLAST+/blastn and about as fast as BLAST+/megablast (which has much reduced sensitivity). LAST is also tunable for better sensitivity with protein-coding nucleotide sequences using a special form of seeding. If other alignment programs are found to be better-suited for a particular data type, these can easily be incorporated into the provided workflows. For instance, local protein sequence alignments can be performed in the homology search step, e.g. by using BLAST+/tblastx. There are fast aligners such as RAPSEARCH2, PAUDA15 and DIAMOND16 that allow searching for homologs in large reference collections of amino acid sequences. To produce compatible input for taxator-tk, the amino acid alignment positions must be converted into nucleotide positions. For our short sequence length evaluation (Supplementary Fig. S6-S8), evaluation of a published SimMC scenario (Supplementary Fig. S21) and evaluation of a simulated metagenome sample with 49 species (Fig. 3, Supplementary Fig. S11S13), we used a standard BLAST+/blastn (version 2.2.28+) and BLAST+/tblastx search. We chose the default alignment parameters and scoring schemes with each aligner. The generated alignments were then provided in BLAST tabular format to be usable with CARMA3 and MEGAN4/MEGAN5. Taxator-tk reads a simplistic tabseparated alignment format that can be generated directly with BLAST+ or with conversion scripts which we provide for the MAF alignment format of LAST. This arrangement ensures that taxator-tk can be easily adapted to profit from

taxator-tk Supplementary Material Page 9 of 51

advancements in the field of local alignment in future. Users can also employ amino acid level alignment if the final output is mapped back to positions on the nucleotide reference and query sequences. The easiest way to achieve this is to use BLAST+/tblastx although this is computationally more demanding than directly searching a collection of protein sequences for which also nucleotide sequences are available. X.

Program Parameters and Versions For taxonomic assignment with MEGAN4 (version 4.70.4) we used minscore=20,

toppercent=20, minsupport=5 and mincomplexity=0.44 parameters. With MEGAN5 (version 5.4.3), we used the default options minsupport=10, minscore=50, max_expected=0.01, minimal_coverage_heuristic=on and top_percent=20, as with MEGAN4. In CARMA3, we used the standard parameters in the contained configuration file. Kraken (version 0.10.4b) was also applied with the standard commands and without shrinking the database (shrink_db.sh). Taxator-tk (version 1.1.1-extended) was run with standard settings, being restricted to the 50 best scoring local alignments to avoid long run-times for some of the query sequences. This is purely a convenience filter at the current state of development and is meant to be replaced by an adaptive per-segment heuristic. XI. 16S Cross-validation We evaluated the performance of taxator-tk in classifying the most widely used taxonomic marker gene in studies of microbial diversity, the 16S rRNA gene, as a proof of concept. For our evaluation, we extracted 7,175 annotated 16S rRNA genes (Suppl. Fig. 5) each with a minimum length of 1 kb from mRefSeq47 (Suppl. Fig. 9). The sequences were assigned with taxator-tk using the entire mRefSeq as reference, not just 16S genes. The cross-validation assesses the performance of 16S gene assignment in a wide range of situations. The performance statistics were calculated based on the number of assigned sequences, as all have comparable length. When using the complete reference sequences, 87% of sequences were assigned to the ranks of species, genus and family with 100% accuracy (Supplementary Fig. S3b), the remaining 13% were correctly assigned at higher ranks. This is an ideal situation showing the baseline on our dataset (in terms of the assigned rank depth). In more realistic simulations, when we tested assignment of genes from novel species or novel higher-level clades, assignments were accordingly made to higher ranks in

taxator-tk Supplementary Material Page 10 of 51

most cases. For instance, when simulation novels species, 2,678 contigs were assigned to the correct genera, while 491 erroneous species and genus assignments were made. The macro-precision in the combined cross-validation (Fig. 2) was always above 92%, with standard deviations from 10 to 25%, which demonstrates a good and even performance of taxator-tk for all clades in the case of 16S rRNA data. XII. FAMeS Cross-validation On the FAMeS contig datasets, taxator-tk produced fewer errors for all taxonomic ranks than MEGAN4, which was accompanied by a moderate reduction in macrorecall throughout all individual experiments and in the combined cross-validation experiments: For SimMC, the macro-precision was three to four times as large as MEGAN4's for species to order, with higher macro-recall (Supplementary Fig. S17S18). The species to family overall precision was ~91% for taxator-tk (~59% for MEGAN4) and taxator-tk estimated 54 species bins (MEGAN4 188) for the 47 actual species in SimMC. Similarly, for SimHC, taxator-tk achieved a higher macro-precision for all ranks, which was most pronounced for class and phylum (Supplementary Fig. S19-S20). By contrast, the macro-recall was slightly reduced and both methods underestimated the 96 existing species in SimHC. XIII. Supplementary Files The PDF attachment includes informative interactive charts and files which are necessary to reproduce the results which are shown in the article. Larger benchmark data can be downloaded from http://algbio.cs.uni-duesseldorf.de/software/. Supplementary Methods References 1. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–72 (2006). 2. Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590–6 (2013). 3. Huson, D. H., Auch, A. F., Qi, J. & Schuster, S. C. MEGAN analysis of metagenomic data. Genome Res. 17, 377–86 (2007). 4. Monzoorul Haque, M., Ghosh, T. S., Komanduri, D. & Mande, S. S. SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics 25, 1722–30 (2009). 5. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).

taxator-tk Supplementary Material Page 11 of 51

6. Gerlach, W. & Stoye, J. Taxonomic classification of metagenomic shotgun sequences with CARMA3. Nucleic Acids Res. 1–11 (2011). 7. Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39 Suppl 2, W29–37 (2011). 8. Frith, M. C., Hamada, M. & Horton, P. Parameters for accurate genome alignment. BMC Bioinformatics 11, 80 (2010). 9. SeqAn. at http://www.seqan.de 10. McHardy, A. C., Martín, H. G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4, 63–72 (2007). 11. Zhao, Y., Tang, H. & Ye, Y. RAPSearch2: a fast and memory-efficient protein similarity search tool for next generation sequencing data. Bioinformatics 28, 125–126 (2011). 12. Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res. 12, 656–64 (2002). 13. Niu, B., Zhu, Z., Fu, L., Wu, S. & Li, W. FR-HIT, a very fast program to recruit metagenomic reads to homologous reference genomes. Bioinformatics 27, 1704–5 (2011). 14. Darling, A. E. et al. PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2, e243 (2014). 15. Huson, D. H. & Xie, C. A poor man’s BLASTX--high-throughput metagenomic protein database search using PAUDA. Bioinformatics 30, 38–9 (2014). 16. Buchfink, B., Xie, C. & Huson, D. H. Fast and Sensitive Protein Alignment using DIAMOND, under review.

taxator-tk Supplementary Material Page 12 of 51

Supplementary Figure S1: Query sequence segmentation and segment splicing

Query and corresponding reference segments from local alignment region extension and splicing. Blue bars correspond to original local alignment regions on reference nucleotide sequences which are positionally aligned to the query nucleotide sequence in red. These alignments are generated by a local (nucleotide) sequence aligner such as BLAST or LAST before running taxator. If alignments overlap on the query, they are joined into query segments which are flanked by regions without detected similarity to any known reference sequence. Reference segments are constructed from the original alignment reference regions (blue) by extension (gray bars) with the same number of nucleotides which are missing to match the length of the query segment. The corresponding sets of homologs are the input to the core taxonomic assignment algorithm in taxator.

taxator-tk Supplementary Material Page 13 of 51

Supplementary Figure S2: Taxonomic assignment of segments

unassigned class family genus

Four long contigs of the SimMC data-set. Colored boxes show segments that where assigned by taxator, when all species reference data was removed (new species simulation). White regions in between lack alignments by the local alignment search and have therefore no homologs for assignment. All assigned regions in this example are consistently assigned at the taxonomic ranks genus, family and class. The shown segments are used by the program binner to derive consistent whole-sequence taxonomic assignments, as done in our evaluations.

taxator-tk Supplementary Material Page 14 of 51

(a) summary scenario

Supplementary Figure S3 - 16S gene assignment with taxator-tk true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 2.6 2.6

274.4 2159.6 869.1 1417.9 420.1 471.6 636.7 623.8 6598.8 6873.3

0.0 0.0 66.6 92.6 69.4 26.1 65.0 83.0 402.7 402.7

macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 95.5 0 96.1 0 95.4 0 95.7 0 95.8 0 92.6 0 95.9 0 96.4

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 13.8 10.7 12.6 13.5 16.1 24.5 13.0 11.4

1 2 13 25 62 148 342 570 166.0 145.4

100.0 82.7 33.5 27.3 20.7 16.0 9.2 5.1 27.8 36.8

0.0 14.2 23.6 18.4 14.2 11.9 8.9 6.7 14.0 12.2

1 2 32 52 109 235 615 1416 351.6 307.8

sum true (sequences)

sum false (sequences)

overall prec.

description

4593.6

0.0

100.0

root+superkingdom

2707.1

228.6

92.2

phylum+class+order

1732.1

174.1

90.9

family+genus+species

94.2 94.5

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 5.0 5.0

10 80 113 428 272 750 1779 3743 7165 7175

0 0 0 0 0 0 0 0 0 0

5,000

80 70

4,000

60 50

3,000

40 2,000

30 20

1,000

10 phylum

class

order

family

genus

species

false (sequences) true (sequences) macro precision α=0.99 macro recall

0

% macro-precision and macro-recall

6,000

90

number of assigned sequences

% macro-precision and macro-recall

100

unassigned superkingdom

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 4.6 4.6

22 313 347 989 948 1350 2678 0 6625 6647

0 0 2 8 9 18 64 427 528 528

0 0 0 0 0 0 0 0 0 0

(c) new species scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 3.9 7.9 7.3 18.9 0.0 5.4 4.7

1 2 14 26 54 91 88 54 47.0 41.3

100.0 96.7 54.8 53.9 44.2 28.8 10.8 0.0 41.3 48.6

0.0 3.2 39.5 41.6 39.6 37.5 26.3 0.0 26.8 23.5

1 2 32 52 109 235 615 1416 351.6 307.8

sum true (sequences)

sum false (sequences)

overall prec.

description

648

0

100.0

root+superkingdom

2284

19

99.2

phylum+class+order

4028

509

88.8

family+genus+species

92.6 92.6

all but unassigned all with unassigned

70

4,000

60 50

3,000

40 2,000

30 20

1,000

10 class

order

1 2 32 52 109 235 615 1416 351.6 307.8

sum true (sequences)

sum false (sequences)

overall prec.

description

170

0

100.0

root+superkingdom

813

0

100.0

phylum+class+order

6272

0

100.0

family+genus+species

100.0 100.0

all but unassigned all with unassigned

6,000 5,000

80 70

4,000

60 50

3,000

40 2,000

30 20

1,000

10 0

unassigned superkingdom

phylum

class

order

family

genus

species

false (sequences) true (sequences) macro precision α=0.99 macro recall

0

family

genus

(d) new genus scenario

Supplementary Figure S3 - 16S gene assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

species

0

false (sequences) true (sequences) macro precision α=0.99 macro recall

% macro-precision and macro-recall

5,000

80

number of assigned sequences

% macro-precision and macro-recall

6,000

90

phylum

real bins

0.0 2.4 38.3 37.2 39.6 40.4 48.0 46.8 36.1 31.6

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 3.3 3.3

48 804 1098 2392 1190 1201 0 0 6685 6733

0 0 2 8 9 36 344 43 442 442

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 100.0 100.0 98.5 95.4 78.8 0.0 0.0 67.5 71.6

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 4.7 17.8 39.2 0.0 0.0 8.8 7.7

1 2 12 22 48 59 34 8 26.4 23.3

100.0 96.5 46.8 36.3 25.2 11.7 0.0 0.0 30.9 39.6

0.0 3.0 38.9 35.7 32.9 26.7 0.0 0.0 19.6 17.2

1 2 32 52 109 235 615 1416 351.6 307.8

sum true (sequences)

sum false (sequences)

overall prec.

description

1656

0

100.0

root+superkingdom

4680

19

99.6

phylum+class+order

1201

423

74.0

family+genus+species

93.8 93.8

all but unassigned all with unassigned

taxator-tk on RefSeq 16S genes

100

unassigned superkingdom

stdev

100.0 97.6 78.3 77.6 72.4 71.0 53.9 35.8 69.5 73.3

90

taxator-tk on RefSeq 16S genes

0

macro recall

1 2 16 29 67 158 337 504 159.0 139.3

taxonomic rank

Supplementary Figure S3 - 16S gene assignment with taxator-tk rank

pred. bins

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

100

taxonomic rank

macro precision α=0.99 100.0 100.0 100.0 98.9 98.8 98.2 95.2 0.0 84.4 86.4

stdev

taxator-tk on RefSeq 16S genes

taxator-tk on RefSeq 16S genes

0

macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0

number of assigned sequences

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

100

6,000

90 5,000

80 70

4,000

60 50

3,000

40 2,000

30 20

1,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 15 of 51

family

genus

species

0

number of assigned sequences

rank

(b) all reference scenario

Supplementary Figure S3 - 16S gene assignment with taxator-tk

false (sequences) true (sequences) macro precision α=0.99 macro recall

(e) new family scenario

Supplementary Figure S3 - 16S gene assignment with taxator-tk true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 2.4 2.4

299 1321 1442 3485 531 0 0 0 6779 7078

0 0 2 11 15 38 17 14 97 97

macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 100.0 0 97.7 0 69.6 0 0.0 0 0.0 0 0.0 0 52.5 0 58.4

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 7.2 42.6 0.0 0.0 0.0 7.1 6.2

1 2 7 13 28 24 9 3 12.3 10.9

100.0 82.8 25.4 15.0 3.4 0.0 0.0 0.0 18.1 28.3

0.0 13.7 35.2 26.2 12.1 0.0 0.0 0.0 12.5 10.9

1 2 32 52 109 235 615 1416 351.6 307.8

sum true (sequences)

sum false (sequences)

overall prec.

description

2941

0

100.0

root+superkingdom

5458

28

99.5

phylum+class+order

0

69

0.0

family+genus+species

98.6 98.6

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 2.1 2.1

424 1920 1665 2631 0 0 0 0 6216 6640

0 0 2 12 434 74 2 11 535 535

6,000

90 5,000

80 70

4,000

60 50

3,000

40 2,000

30 20

1,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (sequences) true (sequences) macro precision α=0.99 macro recall

0

% macro-precision and macro-recall

100

0

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 1.2 1.2

549 4734 1419 0 0 0 0 0 6153 6702

0 0 67 390 3 9 1 3 473 473

0 0 0 0 0 0 0 0 0 0

(g) new class scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 15.4 0.0 0.0 0.0 0.0 0.0 2.2 1.9

1 2 4 8 8 6 2 1 4.4 4.0

100.0 73.7 9.8 0.0 0.0 0.0 0.0 0.0 11.9 22.9

0.0 19.6 23.9 0.0 0.0 0.0 0.0 0.0 6.2 5.4

1 2 32 52 109 235 615 1416 351.6 307.8

sum true (sequences)

sum false (sequences)

overall prec.

description

10017

0

100.0

root+superkingdom

1419

460

75.5

phylum+class+order

0

13

0.0

family+genus+species

92.9 93.4

all but unassigned all with unassigned

70

4,000

60 50

3,000

40 2,000

30 20

1,000

10 class

order

1 2 32 52 109 235 615 1416 351.6 307.8

sum true (sequences)

sum false (sequences)

overall prec.

description

4264

0

100.0

root+superkingdom

4296

448

90.6

phylum+class+order

0

87

0.0

family+genus+species

92.1 92.5

all but unassigned all with unassigned

6,000 5,000

80 70

4,000

60 50

3,000

40 2,000

30 20

1,000

10 0

unassigned superkingdom

phylum

class

order

family

genus

species

false (sequences) true (sequences) macro precision α=0.99 macro recall

0

family

genus

(h) new phylum scenario

Supplementary Figure S3 - 16S gene assignment with taxator-tk rank

depth 0 1 2 3 4 5 6 7 1.1 1.1

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

species

0

false (sequences) true (sequences) macro precision α=0.99 macro recall

% macro-precision and macro-recall

5,000

80

number of assigned sequences

% macro-precision and macro-recall

6,000

90

phylum

real bins

0.0 22.3 31.6 20.8 0.0 0.0 0.0 0.0 10.7 9.3

true (sequences)

false (sequences)

569 5945 0 0 0 0 0

0 0 391 219 16 8 27

5945 6514

661 661

macro unknown (sequences) precision stdev α=0.99 0 100.0 0.0 0 100.0 0.0 0 0.0 0.0 0 0.0 0.0 0 0.0 0.0 0 0.0 0.0 0 0.0 0.0 nan nan 0 16.7 0.0 0 28.6 0.0

pred. bins

macro recall

stdev

real bins

1 1 5 8 7 5 2 0 4.0 3.6

100.0 58.5 0.0 0.0 0.0 0.0 0.0 0.0 8.4 19.8

0.0 35.3 0.0 0.0 0.0 0.0 0.0 0.0 5.0 4.4

1 2 32 52 109 235 615 1416 351.6 307.8

sum true (sequences)

sum false (sequences)

overall prec.

description

12459

0

100.0

root+superkingdom

0

626

0.0

phylum+class+order

0

35

0.0

family+genus+species

90.0 90.8

all but unassigned all with unassigned

taxator-tk on RefSeq 16S genes

100

unassigned superkingdom

stdev

100.0 72.9 19.1 7.9 0.0 0.0 0.0 0.0 14.3 25.0

90

taxator-tk on RefSeq 16S genes

0

macro recall

1 2 6 8 17 11 3 1 6.9 6.1

taxonomic rank

Supplementary Figure S3 - 16S gene assignment with taxator-tk rank

pred. bins

0.0 0.0 0.1 1.1 0.0 0.0 0.0 0.0 0.2 0.1

100

taxonomic rank

macro precision α=0.99 100.0 100.0 90.3 0.0 0.0 0.0 0.0 0.0 27.2 36.3

stdev

taxator-tk on RefSeq 16S genes number of assigned sequences

% macro-precision and macro-recall

taxator-tk on RefSeq 16S genes

macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 100.0 0 99.6 0 0.0 0 0.0 0 0.0 0 0.0 0 42.8 0 49.9

number of assigned sequences

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

100

6,000

90 5,000

80 70

4,000

60 50

3,000

40 2,000

30 20

1,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 16 of 51

family

genus

species

0

number of assigned sequences

rank

(f) new order scenario

Supplementary Figure S3 - 16S gene assignment with taxator-tk

false (sequences) true (sequences) macro precision α=0.99 macro recall

Supplementary Figure S4: Taxonomic composition of microbial RefSeq 47

all

Taxonomic composition down to family level of the microbial (bacteria, archaea and viruses) portion of the RefSeq47 sequence data collection using Krona (Ondov et al., 2011). An interactive version can be found in the supplementary files (RefSeq47.krona.html). Abundance is measured in terms of accumulated sequence lengths per clade.

taxator-tk Supplementary Material Page 17 of 51

Supplementary Figure S5: Taxonomic composition of 16S genes extracted from RefSeq47

all

Taxonomic composition down to genus level of the 16S benchmark dataset using Krona (Ondov et al., 2011). The dataset was simulated by extracting every annotated 16S gene in RefSeq47 which was at least 1000 bp long. An interactive version can be found in the supplementary files (refseq16S.krona.html). Abundance is measured as the number of 16S genes.

taxator-tk Supplementary Material Page 18 of 51

Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 2.4 1.5

37391.6 32272.9 4563.7 2164.1 2249.9 2859.3 7852.3 2808.6 54770.7 92162.3

0.0 427.3 2340.3 1120.1 1535.0 591.9 1275.7 547.4 7837.7 7837.7

macro unknown (sequences) precision α=0.99 0 100.0 0 99.2 0 83.0 0 82.0 0 86.5 0 85.1 0 87.3 0 74.0 0 85.3 0 87.1

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 10.2 13.1 11.1 14.7 17.6 34.8 14.5 12.7

1 1 11 23 52 98 202 431 116.9 102.4

100.0 26.4 9.3 8.9 7.8 5.8 3.5 1.0 8.9 20.3

0.0 26.7 8.5 7.6 7.2 6.8 5.6 2.6 9.3 8.1

1 3 32 52 110 240 656 1697 398.6 348.9

sum true (sequences)

sum false (sequences)

overall prec.

description

101937.3

427.3

99.6

root+superkingdom

8977.7

4995.4

64.2

phylum+class+order

13520.1

2415.0

84.8

family+genus+species

87.5 92.2

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 4.1 3.5

10662 18479 4362 2607 4629 8015 31586 19660 89338 100000

0 0 0 0 0 0 0 0 0 0

50,000

70

40,000

60 50

30,000

40 20,000

30 20

10,000

false (sequences) true (sequences) macro precision α=0.99 macro recall

10 0

unassigned superkingdom

phylum

class

order

family

genus

% macro-precision and macro-recall

90

number of assigned sequences

% macro-precision and macro-recall

60,000

80

rank

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 3.4 2.6

22319 27291 5362 3240 4611 7973 23380 0 71857 94176

0 252 746 327 468 255 776 3000 5824 5824

0 0 0 0 0 0 0 0 0 0

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 1.8 2.9 3.3 6.7 21.6 0.0 5.2 4.5

1 1 10 22 45 75 100 217 67.1 58.9

100.0 35.1 17.5 18.1 15.4 10.9 5.0 0.0 14.6 25.3

0.0 32.5 18.3 19.1 18.5 18.3 14.2 0.0 17.3 15.1

1 3 32 52 110 240 656 1697 398.6 348.9

sum true (sequences)

sum false (sequences)

overall prec.

description

76901

252

99.7

root+superkingdom

13213

1541

89.6

phylum+class+order

31353

4031

88.6

family+genus+species

92.5 94.2

all but unassigned all with unassigned

70

40,000

60 50

30,000

40 20,000

30 20

10,000

10 order

100.0 48.0 35.2 35.7 33.9 27.8 19.2 6.9 29.5 38.4

0.0 37.0 28.1 27.1 28.2 29.2 28.2 18.2 28.0 24.5

1 3 32 52 110 240 656 1697 398.6 348.9

overall prec.

description

47620

0

100.0

root+superkingdom

11598

0

100.0

phylum+class+order

59261

0

100.0

family+genus+species

100.0 100.0

all but unassigned all with unassigned

50,000

70

40,000

60 50

30,000

40 20,000

30 20

10,000

false (sequences) true (sequences) macro precision α=0.99 macro recall

10 unassigned superkingdom

phylum

class

order

family

genus

0

species

Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

family

genus

species

0

false (sequences) true (sequences) macro precision α=0.99 macro recall

% macro-precision and macro-recall

50,000

80

number of assigned sequences

% macro-precision and macro-recall

90

class

1 2 12 24 54 104 211 365 110.3 96.6

sum false (sequences)

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 2.1 1.4

34909 37775 6814 3906 3846 4027 0 0 56368 91277

0 343 1406 689 962 657 4422 244 8723 8723

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 99.3 82.6 82.0 80.6 49.4 0.0 0.0 56.3 61.7

(d) new genus scenario

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 22.0 17.8 17.2 39.3 0.0 0.0 13.8 12.0

1 1 8 19 44 77 193 103 63.6 55.8

100.0 27.2 6.9 6.1 4.4 1.7 0.0 0.0 6.6 18.3

0.0 28.0 9.4 7.9 7.5 5.3 0.0 0.0 8.3 7.3

1 3 32 52 110 240 656 1697 398.6 348.9

sum true (sequences)

sum false (sequences)

overall prec.

description

110459

343

99.7

root+superkingdom

14566

3057

82.7

phylum+class+order

4027

5323

43.1

family+genus+species

86.6 91.3

all but unassigned all with unassigned

taxator-tk on simulated 100bp sequences 60,000

phylum

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

sum true (sequences)

60,000

taxator-tk on simulated 100bp sequences

unassigned superkingdom

real bins

80

0

(c) new species scenario

100

0

stdev

taxonomic rank

Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

macro recall

90

taxonomic rank

macro precision α=0.99 100.0 99.6 97.2 97.2 97.3 96.1 90.4 0.0 82.5 84.7

pred. bins

100

0

species

stdev

taxator-tk on simulated 100bp sequences

taxator-tk on simulated 100bp sequences 100

macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0

(b) all reference scenario

number of assigned sequences

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk

100

60,000

90 50,000

80 70

40,000

60 50

30,000

40 20,000

30 20

10,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 19 of 51

family

genus

species

0

number of assigned sequences

rank

(a) summary scenario

false (sequences) true (sequences) macro precision α=0.99 macro recall

Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 1.7 1.0

40215 40076 6632 3425 2663 0 0 0 52796 93011

0 525 1627 904 1309 1045 1413 166 6989 6989

macro unknown (sequences) precision α=0.99 0 100.0 0 98.9 0 80.7 0 59.6 0 31.9 0 0.0 0 0.0 0 0.0 0 38.7 0 46.4

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 7.2 27.6 33.3 0.0 0.0 0.0 9.7 8.5

1 1 6 14 43 120 133 80 56.7 49.8

100.0 22.2 2.9 2.2 0.9 0.0 0.0 0.0 4.0 16.0

0.0 27.0 5.6 3.9 2.5 0.0 0.0 0.0 5.6 4.9

1 3 32 52 110 240 656 1697 398.6 348.9

sum true (sequences)

sum false (sequences)

overall prec.

description

120367

525

99.6

root+superkingdom

12720

3840

76.8

phylum+class+order

0

2624

0.0

family+genus+species

88.3 93.0

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 1.5 0.9

44037 38701 5817 1971 0 0 0 0 46489 90526

0 563 3314 1414 2189 961 873 160 9474 9474

50,000

70

40,000

60 50

30,000

40 20,000

30 20

10,000

false (sequences) true (sequences) macro precision α=0.99 macro recall

10 0

unassigned superkingdom

phylum

class

order

family

genus

% macro-precision and macro-recall

90 80

rank

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 1.5 0.7

50150 36288 2959 0 0 0 0 0 39247 89397

0 579 4203 2365 1894 692 742 128 10603 10603

0 0 0 0 0 0 0 0 0 0

(g) new class scenario

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 24.4 0.0 0.0 0.0 0.0 0.0 3.5 3.0

1 1 7 20 49 100 109 63 49.9 43.8

100.0 17.9 0.7 0.0 0.0 0.0 0.0 0.0 2.7 14.8

0.0 22.8 2.0 0.0 0.0 0.0 0.0 0.0 3.5 3.1

1 3 32 52 110 240 656 1697 398.6 348.9

sum true (sequences)

sum false (sequences)

overall prec.

description

122726

579

99.5

root+superkingdom

2959

8462

25.9

phylum+class+order

0

1562

0.0

family+genus+species

78.7 89.4

all but unassigned all with unassigned

70

40,000

60 50

30,000

40 20,000

30 20

10,000

10 order

100.0 20.3 1.8 0.6 0.0 0.0 0.0 0.0 3.2 15.3

0.0 25.5 4.0 1.5 0.0 0.0 0.0 0.0 4.4 3.9

1 3 32 52 110 240 656 1697 398.6 348.9

overall prec.

description

121439

563

99.5

root+superkingdom

7788

6917

53.0

phylum+class+order

0

1994

0.0

family+genus+species

83.1 90.5

all but unassigned all with unassigned

50,000

70

40,000

60 50

30,000

40 20,000

30 20

10,000

false (sequences) true (sequences) macro precision α=0.99 macro recall

10 unassigned superkingdom

phylum

class

order

family

genus

species

0

Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

family

genus

species

0

false (sequences) true (sequences) macro precision α=0.99 macro recall

% macro-precision and macro-recall

50,000

80

number of assigned sequences

% macro-precision and macro-recall

90

class

1 1 6 18 49 106 118 72 52.9 46.4

sum false (sequences)

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 1.7 0.7

59449 27300 0 0 0 0 0 0 27300 86749

0 729 5086 2142 3923 533 704 134 13251 13251

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 97.9 0.0 0.0 0.0 0.0 0.0 0.0 14.0 24.7

(h) new phylum scenario

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 1 11 20 44 98 105 60 48.4 42.5

100.0 14.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 14.3

0.0 18.6 0.0 0.0 0.0 0.0 0.0 0.0 2.7 2.3

1 3 32 52 110 240 656 1697 398.6 348.9

sum true (sequences)

sum false (sequences)

overall prec.

description

114049

729

99.4

root+superkingdom

0

11151

0.0

phylum+class+order

0

1371

0.0

family+genus+species

67.3 86.7

all but unassigned all with unassigned

taxator-tk on simulated 100bp sequences 60,000

phylum

0.0 0.0 21.5 23.8 0.0 0.0 0.0 0.0 6.5 5.7

sum true (sequences)

60,000

taxator-tk on simulated 100bp sequences

unassigned superkingdom

real bins

taxonomic rank

100

0

stdev

80

0

Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

macro recall

90

taxonomic rank

macro precision α=0.99 100.0 98.6 27.7 0.0 0.0 0.0 0.0 0.0 18.0 28.3

pred. bins

100

0

species

stdev

taxator-tk on simulated 100bp sequences 60,000

number of assigned sequences

% macro-precision and macro-recall

taxator-tk on simulated 100bp sequences 100

macro unknown (sequences) precision α=0.99 0 100.0 0 98.8 0 60.2 0 23.3 0 0.0 0 0.0 0 0.0 0 0.0 0 26.0 0 35.3

(f) new order scenario

number of assigned sequences

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S6 - Simulated 100 bp sequence assignment with taxator-tk

100

60,000

90 50,000

80 70

40,000

60 50

30,000

40 20,000

30 20

10,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 20 of 51

family

genus

species

0

number of assigned sequences

rank

(e) new family scenario

false (sequences) true (sequences) macro precision α=0.99 macro recall

Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 2.3 1.8

20001.1 39599.4 7862.7 3756.1 3446.6 3162.4 7880.9 3622.7 69330.9 89332.0

0.0 582.0 3532.7 1555.3 2138.3 702.6 1428.4 728.7 10668.0 10668.0

macro unknown (sequences) precision α=0.99 0 100.0 0 99.1 0 84.1 0 81.8 0 85.1 0 84.6 0 87.6 0 76.5 0 85.6 0 87.4

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 12.4 14.8 13.1 17.2 19.3 34.0 15.8 13.8

1 1 12 24 56 104 212 480 127.0 111.3

100.0 53.6 13.2 12.1 10.2 7.1 4.2 1.4 14.5 25.2

0.0 26.8 11.4 9.4 8.6 7.8 6.3 3.4 10.5 9.2

1 2 32 52 110 240 656 1693 397.9 348.3

sum true (sequences)

sum false (sequences)

overall prec.

description

99200.0

582.0

99.4

root+superkingdom

15065.4

7226.3

67.6

phylum+class+order

14666.0

2859.7

83.7

family+genus+species

86.7 89.3

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 4.0 3.6

7999 17442 4415 2699 4636 7889 29561 25359 92001 100000

0 0 0 0 0 0 0 0 0 0

40,000

70 60

30,000

50 40

20,000

30 20

10,000

false (sequences) true (sequences) macro precision α=0.99 macro recall

10 0

unassigned superkingdom

phylum

class

order

family

genus

% macro-precision and macro-recall

90

number of assigned sequences

% macro-precision and macro-recall

50,000

80

rank

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 3.5 3.1

10520 28378 8027 4991 6476 9009 25605 0 82486 93006

0 224 773 337 481 253 910 4016 6994 6994

0 0 0 0 0 0 0 0 0 0

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.5 1.2 2.6 2.8 6.6 21.1 0.0 5.0 4.3

1 2 11 23 50 79 107 237 72.7 63.8

100.0 68.6 24.7 24.9 21.0 14.2 6.3 0.0 22.8 32.5

0.0 21.6 23.6 23.3 22.3 21.7 16.4 0.0 18.4 16.1

1 2 32 52 110 240 656 1693 397.9 348.3

sum true (sequences)

sum false (sequences)

overall prec.

description

67276

224

99.7

root+superkingdom

19494

1591

92.5

phylum+class+order

34614

5179

87.0

family+genus+species

92.2 93.0

all but unassigned all with unassigned

40,000

70 60

30,000

50 40

20,000

30 20

10,000

10 order

100.0 81.8 43.0 43.7 40.6 32.6 23.0 9.5 39.2 46.8

0.0 10.7 30.7 29.3 30.9 31.8 32.0 23.5 27.0 23.6

1 2 32 52 110 240 656 1693 397.9 348.3

overall prec.

description

42883

0

100.0

root+superkingdom

11750

0

100.0

phylum+class+order

62809

0

100.0

family+genus+species

100.0 100.0

all but unassigned all with unassigned

40,000

70 60

30,000

50 40

20,000

30 20

10,000

false (sequences) true (sequences) macro precision α=0.99 macro recall

10 unassigned superkingdom

phylum

class

order

family

genus

0

species

Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

family

genus

species

0

false (sequences) true (sequences) macro precision α=0.99 macro recall

% macro-precision and macro-recall

80

number of assigned sequences

% macro-precision and macro-recall

90

class

1 2 14 27 59 109 221 408 120.0 105.1

sum false (sequences)

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 2.2 1.9

13884 43450 12278 7723 7610 5239 0 0 76300 90184

0 357 1548 760 1032 761 5064 294 9816 9816

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 99.5 95.9 87.2 86.3 52.7 0.0 0.0 60.2 65.2

(d) new genus scenario

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 2.9 13.6 12.4 40.4 0.0 0.0 9.9 8.7

1 1 9 21 47 81 136 105 57.1 50.1

100.0 60.2 13.4 10.9 7.8 2.8 0.0 0.0 13.6 24.4

0.0 26.6 15.2 12.4 11.2 7.9 0.0 0.0 10.5 9.2

1 2 32 52 110 240 656 1693 397.9 348.3

sum true (sequences)

sum false (sequences)

overall prec.

description

100784

357

99.6

root+superkingdom

27611

3340

89.2

phylum+class+order

5239

6119

46.1

family+genus+species

88.6 90.2

all but unassigned all with unassigned

taxator-tk on simulated 500bp sequences 50,000

phylum

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

sum true (sequences)

50,000

taxator-tk on simulated 500bp sequences

unassigned superkingdom

real bins

80

0

(c) new species scenario

100

0

stdev

taxonomic rank

Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

macro recall

90

taxonomic rank

macro precision α=0.99 100.0 99.2 98.4 97.8 97.7 96.6 90.8 0.0 82.9 85.1

pred. bins

100

0

species

stdev

taxator-tk on simulated 500bp sequences

taxator-tk on simulated 500bp sequences 100

macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0

(b) all reference scenario

number of assigned sequences

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk

100

50,000

90 80

40,000

70 60

30,000

50 40

20,000

30 20

10,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 21 of 51

family

genus

species

0

number of assigned sequences

rank

(a) summary scenario

false (sequences) true (sequences) macro precision α=0.99 macro recall

Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 1.8 1.5

18731 47733 12581 7177 5404 0 0 0 72895 91626

0 702 2006 1099 1570 1251 1492 254 8374 8374

macro unknown (sequences) precision α=0.99 0 100.0 0 98.9 0 88.9 0 64.4 0 40.0 0 0.0 0 0.0 0 0.0 0 41.8 0 49.0

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 4.2 31.5 36.6 0.0 0.0 0.0 10.3 9.0

1 1 6 16 42 103 122 82 53.1 46.6

100.0 47.8 5.9 4.0 1.6 0.0 0.0 0.0 8.5 19.9

0.0 34.0 10.5 6.7 4.2 0.0 0.0 0.0 7.9 6.9

1 2 32 52 110 240 656 1693 397.9 348.3

sum true (sequences)

sum false (sequences)

overall prec.

description

114197

702

99.4

root+superkingdom

25162

4675

84.3

phylum+class+order

0

2997

0.0

family+genus+species

89.7 91.6

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 1.6 1.2

21881 49724 11475 3703 0 0 0 0 64902 86783

0 771 4991 1888 3483 1003 868 213 13217 13217

40,000

70 60

30,000

50 40

20,000

30 20

10,000

false (sequences) true (sequences) macro precision α=0.99 macro recall

10 0

unassigned superkingdom

phylum

class

order

family

genus

% macro-precision and macro-recall

90 80

rank

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 1.5 1.1

28770 49324 6263 0 0 0 0 0 55587 84357

0 838 6679 3676 2612 852 834 152 15643 15643

0 0 0 0 0 0 0 0 0 0

(g) new class scenario

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 28.9 0.0 0.0 0.0 0.0 0.0 4.1 3.6

1 1 7 21 52 108 103 56 49.7 43.6

100.0 39.9 1.5 0.0 0.0 0.0 0.0 0.0 5.9 17.7

0.0 31.6 3.8 0.0 0.0 0.0 0.0 0.0 5.1 4.4

1 2 32 52 110 240 656 1693 397.9 348.3

sum true (sequences)

sum false (sequences)

overall prec.

description

127418

838

99.3

root+superkingdom

6263

12967

32.6

phylum+class+order

0

1838

0.0

family+genus+species

78.0 84.4

all but unassigned all with unassigned

40,000

70 60

30,000

50 40

20,000

30 20

10,000

10 order

100.0 44.4 3.6 1.1 0.0 0.0 0.0 0.0 7.0 18.6

0.0 34.2 7.8 2.5 0.0 0.0 0.0 0.0 6.4 5.6

1 2 32 52 110 240 656 1693 397.9 348.3

overall prec.

description

121329

771

99.4

root+superkingdom

15178

10362

59.4

phylum+class+order

0

2084

0.0

family+genus+species

83.1 86.8

all but unassigned all with unassigned

40,000

70 60

30,000

50 40

20,000

30 20

10,000

false (sequences) true (sequences) macro precision α=0.99 macro recall

10 unassigned superkingdom

phylum

class

order

family

genus

species

0

Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

family

genus

species

0

false (sequences) true (sequences) macro precision α=0.99 macro recall

% macro-precision and macro-recall

80

number of assigned sequences

% macro-precision and macro-recall

90

class

1 1 6 18 56 106 110 63 51.4 45.1

sum false (sequences)

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 1.6 1.0

38223 41145 0 0 0 0 0 0 41145 79368

0 1182 8732 3127 5790 798 831 172 20632 20632

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 97.7 0.0 0.0 0.0 0.0 0.0 0.0 14.0 24.7

(h) new phylum scenario

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 1 11 22 44 106 93 45 46.0 40.4

100.0 32.6 0.0 0.0 0.0 0.0 0.0 0.0 4.7 16.6

0.0 28.9 0.0 0.0 0.0 0.0 0.0 0.0 4.1 3.6

1 2 32 52 110 240 656 1693 397.9 348.3

sum true (sequences)

sum false (sequences)

overall prec.

description

120513

1182

99.0

root+superkingdom

0

17649

0.0

phylum+class+order

0

1801

0.0

family+genus+species

66.6 79.4

all but unassigned all with unassigned

taxator-tk on simulated 500bp sequences 50,000

phylum

0.0 0.0 17.0 29.3 0.0 0.0 0.0 0.0 6.6 5.8

sum true (sequences)

50,000

taxator-tk on simulated 500bp sequences

unassigned superkingdom

real bins

taxonomic rank

100

0

stdev

80

0

Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

macro recall

90

taxonomic rank

macro precision α=0.99 100.0 98.5 34.0 0.0 0.0 0.0 0.0 0.0 18.9 29.1

pred. bins

100

0

species

stdev

taxator-tk on simulated 500bp sequences 50,000

number of assigned sequences

% macro-precision and macro-recall

taxator-tk on simulated 500bp sequences 100

macro unknown (sequences) precision α=0.99 0 100.0 0 98.8 0 71.7 0 31.9 0 0.0 0 0.0 0 0.0 0 0.0 0 28.9 0 37.8

(f) new order scenario

number of assigned sequences

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S7 - Simulated 500 bp sequence assignment with taxator-tk

100

50,000

90 80

40,000

70 60

30,000

50 40

20,000

30 20

10,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 22 of 51

family

genus

species

0

number of assigned sequences

rank

(e) new family scenario

false (sequences) true (sequences) macro precision α=0.99 macro recall

Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 2.4 1.9

18217.3 37796.3 9465.1 4795.0 4417.0 3498.1 7834.3 3837.4 71643.3 89860.6

0.0 550.7 3300.7 1367.1 1966.7 817.1 1397.6 739.4 10139.4 10139.4

macro unknown (sequences) precision α=0.99 0 100.0 0 99.2 0 87.0 0 83.2 0 84.5 0 84.4 0 86.4 0 77.2 0 86.0 0 87.7

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 12.2 14.7 15.1 17.9 19.6 34.2 16.2 14.2

1 1 12 25 57 106 219 472 127.4 111.6

100.0 38.1 15.2 13.4 10.8 7.5 4.3 1.5 13.0 23.8

0.0 33.8 12.7 10.3 9.2 8.1 6.3 3.5 12.0 10.5

1 3 32 52 110 240 653 1690 397.1 347.6

sum true (sequences)

sum false (sequences)

overall prec.

description

93809.9

550.7

99.4

root+superkingdom

18677.1

6634.6

73.8

phylum+class+order

15169.9

2954.1

83.7

family+genus+species

87.6 89.9

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 4.0 3.6

7256 16867 4739 3024 4832 8132 28288 26862 92744 100000

0 0 0 0 0 0 0 0 0 0

40,000

70 60

30,000

50 40

20,000

30 20

10,000

false (sequences) true (sequences) macro precision α=0.99 macro recall

10 0

unassigned superkingdom

phylum

class

order

family

genus

% macro-precision and macro-recall

90

number of assigned sequences

% macro-precision and macro-recall

50,000

80

rank

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 3.5 3.3

7557 26333 9128 6028 7878 9803 26552 0 85722 93279

0 191 523 250 369 325 898 4165 6721 6721

0 0 0 0 0 0 0 0 0 0

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.2 0.6 2.4 4.4 9.0 22.4 0.0 5.6 4.9

1 2 12 24 52 83 107 230 72.9 63.9

100.0 49.7 28.6 28.3 23.0 15.4 6.5 0.0 21.6 31.4

0.0 38.2 25.3 25.2 23.6 22.6 16.7 0.0 21.7 19.0

1 3 32 52 110 240 653 1690 397.1 347.6

sum true (sequences)

sum false (sequences)

overall prec.

description

60223

191

99.7

root+superkingdom

23034

1142

95.3

phylum+class+order

36355

5388

87.1

family+genus+species

92.7 93.3

all but unassigned all with unassigned

40,000

70 60

30,000

50 40

20,000

30 20

10,000

10 order

100.0 55.1 45.2 44.7 40.6 33.1 23.3 10.1 36.0 44.0

0.0 39.9 30.5 29.7 31.1 32.0 32.2 24.8 31.5 27.5

1 3 32 52 110 240 653 1690 397.1 347.6

overall prec.

description

40990

0

100.0

root+superkingdom

12595

0

100.0

phylum+class+order

63282

0

100.0

family+genus+species

100.0 100.0

all but unassigned all with unassigned

40,000

70 60

30,000

50 40

20,000

30 20

10,000

false (sequences) true (sequences) macro precision α=0.99 macro recall

10 unassigned superkingdom

phylum

class

order

family

genus

0

species

Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

family

genus

species

0

false (sequences) true (sequences) macro precision α=0.99 macro recall

% macro-precision and macro-recall

80

number of assigned sequences

% macro-precision and macro-recall

90

class

1 2 14 27 59 112 221 402 119.6 104.8

sum false (sequences)

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 2.4 2.2

9974 39636 14673 9782 10445 6552 0 0 81088 91062

0 293 1006 538 874 955 4978 294 8938 8938

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 98.8 94.6 90.4 90.2 59.0 0.0 0.0 61.9 66.6

(d) new genus scenario

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.8 9.8 11.6 10.8 39.9 0.0 0.0 10.4 9.1

1 2 10 22 47 82 143 94 57.1 50.1

100.0 44.0 17.7 14.2 9.7 3.7 0.0 0.0 12.8 23.7

0.0 37.1 18.6 15.4 13.5 9.4 0.0 0.0 13.4 11.7

1 3 32 52 110 240 653 1690 397.1 347.6

sum true (sequences)

sum false (sequences)

overall prec.

description

89246

293

99.7

root+superkingdom

34900

2418

93.5

phylum+class+order

6552

6227

51.3

family+genus+species

90.1 91.1

all but unassigned all with unassigned

taxator-tk on simulated 1000bp sequences 50,000

phylum

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

sum true (sequences)

50,000

taxator-tk on simulated 1000bp sequences

unassigned superkingdom

real bins

80

0

(c) new species scenario

100

0

stdev

taxonomic rank

Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

macro recall

90

taxonomic rank

macro precision α=0.99 100.0 99.5 99.2 98.4 97.8 96.3 90.5 0.0 83.1 85.2

pred. bins

100

0

species

stdev

taxator-tk on simulated 1000bp sequences

taxator-tk on simulated 1000bp sequences 100

macro unknown (sequences) precision α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0

(b) all reference scenario

number of assigned sequences

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk

100

50,000

90 80

40,000

70 60

30,000

50 40

20,000

30 20

10,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 23 of 51

family

genus

species

0

number of assigned sequences

rank

(a) summary scenario

false (sequences) true (sequences) macro precision α=0.99 macro recall

Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 1.9 1.6

15429 44103 15330 9698 7764 0 0 0 76895 92324

0 644 1458 849 1404 1548 1507 266 7676 7676

macro unknown (sequences) precision α=0.99 0 100.0 0 99.1 0 94.1 0 79.4 0 45.8 0 0.0 0 0.0 0 0.0 0 45.5 0 52.3

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 2.4 24.5 38.6 0.0 0.0 0.0 9.4 8.2

1 1 6 15 40 104 127 65 51.1 44.9

100.0 35.3 8.0 5.4 2.1 0.0 0.0 0.0 7.3 18.9

0.0 36.3 13.7 9.1 5.6 0.0 0.0 0.0 9.3 8.1

1 3 32 52 110 240 653 1690 397.1 347.6

sum true (sequences)

sum false (sequences)

overall prec.

description

103635

644

99.4

root+superkingdom

32792

3711

89.8

phylum+class+order

0

3321

0.0

family+genus+species

90.9 92.3

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (sequences)

false (sequences)

0 1 2 3 4 5 6 7 1.7 1.3

19932 47714 14366 5033 0 0 0 0 67113 87045

0 725 4764 1596 3635 1133 883 219 12955 12955

40,000

70 60

30,000

50 40

20,000

30 20

10,000

false (sequences) true (sequences) macro precision α=0.99 macro recall

10 0

unassigned superkingdom

phylum

class

order

family

genus

% macro-precision and macro-recall

90 80

rank

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 1.5 1.1

28057 48587 8020 0 0 0 0 0 56607 84664

0 824 6696 3649 2300 939 835 93 15336 15336

0 0 0 0 0 0 0 0 0 0

(g) new class scenario

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 34.0 0.0 0.0 0.0 0.0 0.0 4.9 4.2

1 1 7 22 48 94 94 38 43.4 38.1

100.0 28.5 2.2 0.0 0.0 0.0 0.0 0.0 4.4 16.3

0.0 31.4 5.1 0.0 0.0 0.0 0.0 0.0 5.2 4.6

1 3 32 52 110 240 653 1690 397.1 347.6

sum true (sequences)

sum false (sequences)

overall prec.

description

125231

824

99.3

root+superkingdom

8020

12645

38.8

phylum+class+order

0

1867

0.0

family+genus+species

78.7 84.7

all but unassigned all with unassigned

40,000

70 60

30,000

50 40

20,000

30 20

10,000

10 order

100.0 32.4 5.0 1.5 0.0 0.0 0.0 0.0 5.6 17.4

0.0 34.7 10.2 3.5 0.0 0.0 0.0 0.0 6.9 6.0

1 3 32 52 110 240 653 1690 397.1 347.6

overall prec.

description

115360

725

99.4

root+superkingdom

19399

9995

66.0

phylum+class+order

0

2235

0.0

family+genus+species

83.8 87.0

all but unassigned all with unassigned

40,000

70 60

30,000

50 40

20,000

30 20

10,000

false (sequences) true (sequences) macro precision α=0.99 macro recall

10 unassigned superkingdom

phylum

class

order

family

genus

species

0

Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

family

genus

species

0

false (sequences) true (sequences) macro precision α=0.99 macro recall

% macro-precision and macro-recall

80

number of assigned sequences

% macro-precision and macro-recall

90

class

1 1 6 17 49 80 98 50 43.0 37.8

sum false (sequences)

depth

true (sequences)

false (sequences)

unknown (sequences)

0 1 2 3 4 5 6 7 1.6 1.0

39316 41334 0 0 0 0 0 0 41334 80650

0 1178 8658 2688 5185 820 682 139 19350 19350

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 97.7 0.0 0.0 0.0 0.0 0.0 0.0 14.0 24.7

(h) new phylum scenario

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 1 12 21 44 98 99 41 45.1 39.6

100.0 21.9 0.0 0.0 0.0 0.0 0.0 0.0 3.1 15.2

0.0 27.3 0.0 0.0 0.0 0.0 0.0 0.0 3.9 3.4

1 3 32 52 110 240 653 1690 397.1 347.6

sum true (sequences)

sum false (sequences)

overall prec.

description

121984

1178

99.0

root+superkingdom

0

16531

0.0

phylum+class+order

0

1641

0.0

family+genus+species

68.1 80.7

all but unassigned all with unassigned

taxator-tk on simulated 1000bp sequences 50,000

phylum

0.0 0.0 11.9 33.9 0.0 0.0 0.0 0.0 6.5 5.7

sum true (sequences)

50,000

taxator-tk on simulated 1000bp sequences

unassigned superkingdom

real bins

taxonomic rank

100

0

stdev

80

0

Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

macro recall

90

taxonomic rank

macro precision α=0.99 100.0 98.6 39.8 0.0 0.0 0.0 0.0 0.0 19.8 29.8

pred. bins

100

0

species

stdev

taxator-tk on simulated 1000bp sequences 50,000

number of assigned sequences

% macro-precision and macro-recall

taxator-tk on simulated 1000bp sequences 100

macro unknown (sequences) precision α=0.99 0 100.0 0 98.9 0 81.4 0 40.2 0 0.0 0 0.0 0 0.0 0 0.0 0 31.5 0 40.1

(f) new order scenario

number of assigned sequences

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S8 - Simulated 1000 bp sequence assignment with taxator-tk

100

50,000

90 80

40,000

70 60

30,000

50 40

20,000

30 20

10,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 24 of 51

family

genus

species

0

number of assigned sequences

rank

(e) new family scenario

false (sequences) true (sequences) macro precision α=0.99 macro recall

Supplementary Figure S9: Taxonomic composition of microbial RefSeq54

all

Taxonomic composition down to family level of the microbial (bacteria, archaea and viruses) portion of the RefSeq54 sequence data collection using Krona (Ondov et al., 2011). An interactive version can be found in the supplementary files (RefSeq54.krona.html). Abundance is measured in terms of accumulated sequence lengths per clade.

taxator-tk Supplementary Material Page 25 of 51

Supplementary Figure S10: Taxonomic composition of simArt49e

all

Taxonomic composition of the simulated metagenome sample simArt49e using Krona (Ondov et al., 2011). An interactive version can be found in the supplementary files (simArt49e.krona.html). Abundance is measured in terms of accumulated contigs lengths. The reads for this dataset were simulated using equal coverage for every strain, so differences in the data proportions result from a variable genome size and assembly bias.

taxator-tk Supplementary Material Page 26 of 51

Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 4.2 2.2

97460.6 8776.0 5011.0 4568.4 8358.4 10303.4 17193.4 31789.6 86000.3 183460.9

0.0 500.1 7085.1 9153.1 12086.7 12858.1 13745.6 28288.3 83717.1 83717.1

macro unknown precision (bp) α=0.99 0 100.0 0 93.6 0 69.7 0 47.0 0 31.8 0 16.6 0 6.7 0 2.9 0 38.3 0 46.0

stdev

pred. bins

macro recall

stdev

real bins

0.0 3.0 22.7 38.5 39.5 33.6 24.1 16.5 25.4 22.2

1 2 20 36 78 176 553 1672 362.4 317.3

100.0 64.1 36.9 33.4 29.2 24.7 19.3 11.8 31.3 39.9

0.0 17.1 15.2 11.8 9.5 7.1 5.1 2.6 9.8 8.5

1 2 20 23 32 36 41 49 29.0 25.5

(a) summary scenario sum true (bp)

sum false (bp)

overall prec.

description

115012.6

500.1

99.6

root+superkingdom

17937.9

28325.0

38.8

phylum+class+order

59286.4

54892.0

51.9

family+genus+species

50.7 68.7

all but unassigned all with unassigned

Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 5.9 5.8

1071 166 130 108 614 1000 39774 222527 264319 265390

0 0 1 0 0 0 30 1757 1788 1788

50 40 30 20 10 0

unassigned superkingdom

phylum

class

order

family

genus

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 5.1 4.0

48039 3734 5195 6657 17364 43855 80580 0 157385 205424

0 174 1718 2267 3022 4055 10693 39825 61754 61754

0 0 0 0 0 0 0 0 0 0

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.2 5.8 31.7 46.1 44.7 32.0 0.0 22.9 20.0

1 2 17 24 48 96 216 1153 222.3 194.6

100.0 82.3 65.6 63.8 59.4 53.0 35.0 0.0 51.3 57.4

0.0 8.3 31.8 28.3 32.1 33.4 35.5 0.0 24.2 21.2

1 2 20 23 32 36 41 49 29.0 25.5

(c) new species scenario sum true (bp)

sum false (bp)

overall prec.

description

55507

174

99.7

root+superkingdom

29216

7007

80.7

phylum+class+order

124435

54573

69.5

family+genus+species

71.8 76.9

all but unassigned all with unassigned

70 60 50 40 30 20 10 order

0.0 0.1 0.2 0.2 0.2 0.2 0.2 18.3 2.8 2.4

1 2 20 23 32 36 41 49 29.0 25.5

description

1403

0

100.0

root+superkingdom

852

1

99.9

phylum+class+order

263301

1787

99.3

family+genus+species

99.3 99.3

all but unassigned all with unassigned

60 50 40 30 20

false (bp) true (bp) macro precision α=0.99 macro recall

10 unassigned superkingdom

phylum

class

order

family

genus

species

Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

family

genus

species

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 4.3 2.5

101939 7388 7629 8751 20074 27269 0 0 71111 173050

0 386 4042 5633 8920 13450 33904 27793 94128 94128

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 96.1 78.4 53.1 33.8 12.5 0.0 0.0 39.1 46.7

stdev

pred. bins

macro recall

stdev

real bins

0.0 1.9 17.6 39.5 39.9 29.1 0.0 0.0 18.3 16.0

1 2 17 31 65 156 535 1788 370.6 324.4

100.0 67.2 40.7 38.3 32.7 20.1 0.0 0.0 28.4 37.4

0.0 14.9 29.4 26.6 27.5 23.1 0.0 0.0 17.4 15.2

1 2 20 23 32 36 41 49 29.0 25.5

(d) new genus scenario sum true (bp)

sum false (bp)

overall prec.

description

116715

386

99.7

root+superkingdom

36454

18595

66.2

phylum+class+order

27269

75147

26.6

family+genus+species

43.0 64.8

all but unassigned all with unassigned

CARMA3 binning on simulated metagenome with 49 species

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

assigned sequences in bp

% macro-precision and macro-recall

80

class

100.0 99.9 99.9 99.9 99.9 99.9 99.8 82.6 97.4 97.7

overall prec.

taxonomic rank

90

phylum

1 2 19 23 31 35 40 46 28.0 24.6

sum false (bp)

70

CARMA3 binning on simulated metagenome with 49 species

unassigned superkingdom

0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.7 0.4 0.3

sum true (bp)

80

0

100

0

real bins

100 90 80

assigned sequences in bp

rank

stdev

90

species

Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e)

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

macro recall

100

taxonomic rank

macro precision α=0.99 100.0 98.9 93.8 81.4 56.4 32.0 12.5 0.0 53.6 59.4

pred. bins

assigned sequences in bp

60

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

assigned sequences in bp

% macro-precision and macro-recall

90

70

stdev

CARMA3 binning on simulated metagenome with 49 species

CARMA3 binning on simulated metagenome with 49 species 100

80

macro unknown precision (bp) α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 99.5 0 99.9 0 99.9

(b) all reference scenario

70 60 50 40 30 20 10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 27 of 51

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 3.8 2.0

114225 11362 9527 9232 20457 0 0 0 50578 164803

0 536 6860 8904 13293 24317 18709 29756 102375 102375

macro unknown precision (bp) α=0.99 0 100.0 0 92.1 0 48.3 0 23.8 0 9.7 0 0.0 0 0.0 0 0.0 0 24.8 0 34.2

stdev

pred. bins

macro recall

stdev

real bins

0.0 3.8 32.1 32.6 22.8 0.0 0.0 0.0 13.0 11.4

1 2 18 36 81 196 625 1816 396.3 346.9

100.0 57.6 26.3 21.8 12.7 0.0 0.0 0.0 16.9 27.3

0.0 21.4 27.3 25.3 19.9 0.0 0.0 0.0 13.4 11.7

1 2 20 23 32 36 41 49 29.0 25.5

(e) new family scenario sum true (bp)

sum false (bp)

overall prec.

description

136949

536

99.6

root+superkingdom

39216

29057

57.4

phylum+class+order

0

72782

0.0

family+genus+species

33.1 61.7

all but unassigned all with unassigned

Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 3.5 1.6

130123 12342 8019 7231 0 0 0 0 27592 157715

0 706 10411 14435 20646 18233 12779 32253 109463 109463

60 50 40 30 20

false (bp) true (bp) macro precision α=0.99 macro recall

10 0

unassigned superkingdom

phylum

class

order

family

genus

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 3.4 1.4

139175 13051 4577 0 0 0 0 0 17628 156803

0 778 12085 18605 19608 15771 10539 32989 110375 110375

0 0 0 0 0 0 0 0 0 0

stdev

pred. bins

macro recall

stdev

real bins

0.0 9.8 20.1 0.0 0.0 0.0 0.0 0.0 4.3 3.7

1 2 22 41 91 206 657 1814 404.7 354.3

100.0 48.2 8.9 0.0 0.0 0.0 0.0 0.0 8.2 19.6

0.0 24.2 15.4 0.0 0.0 0.0 0.0 0.0 5.7 5.0

1 2 20 23 32 36 41 49 29.0 25.5

(g) new class scenario sum true (bp)

sum false (bp)

overall prec.

description

165277

778

99.5

root+superkingdom

4577

50298

8.3

phylum+class+order

0

59299

0.0

family+genus+species

13.8 58.7

all but unassigned all with unassigned

70 60 50 40 30 20 10 order

0.0 23.4 23.2 16.8 0.0 0.0 0.0 0.0 9.0 7.9

1 2 20 23 32 36 41 49 29.0 25.5

description

154807

706

99.5

root+superkingdom

15250

45492

25.1

phylum+class+order

0

63265

0.0

family+genus+species

20.1 59.0

all but unassigned all with unassigned

60 50 40 30 20

false (bp) true (bp) macro precision α=0.99 macro recall

10 unassigned superkingdom

phylum

class

order

family

genus

species

Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

family

genus

species

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 3.4 1.2

147652 13389 0 0 0 0 0 0 13389 161041

0 921 14479 14228 19118 14181 9565 33645 106137 106137

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 75.1 0.0 0.0 0.0 0.0 0.0 0.0 10.7 21.9

stdev

pred. bins

macro recall

stdev

real bins

0.0 17.3 0.0 0.0 0.0 0.0 0.0 0.0 2.5 2.2

1 2 24 42 93 214 664 1820 408.4 357.5

100.0 41.1 0.0 0.0 0.0 0.0 0.0 0.0 5.9 17.6

0.0 27.2 0.0 0.0 0.0 0.0 0.0 0.0 3.9 3.4

1 2 20 23 32 36 41 49 29.0 25.5

(h) new phylum scenario sum true (bp)

sum false (bp)

overall prec.

description

174430

921

99.5

root+superkingdom

0

47825

0.0

phylum+class+order

0

57391

0.0

family+genus+species

11.2 60.3

all but unassigned all with unassigned

CARMA3 binning on simulated metagenome with 49 species

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

assigned sequences in bp

% macro-precision and macro-recall

80

class

100.0 52.3 17.0 10.1 0.0 0.0 0.0 0.0 11.4 22.4

overall prec.

taxonomic rank

90

phylum

1 2 21 39 90 203 652 1810 402.4 352.3

sum false (bp)

70

CARMA3 binning on simulated metagenome with 49 species

unassigned superkingdom

0.0 7.2 30.5 21.5 0.0 0.0 0.0 0.0 8.5 7.4

sum true (bp)

80

0

100

0

real bins

100 90 80

assigned sequences in bp

rank

stdev

90

species

Supplementary Figure S11 - CARMA binning of simulated metagenome with 49 species (simArt49e)

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

macro recall

100

taxonomic rank

macro precision α=0.99 100.0 84.3 12.9 0.0 0.0 0.0 0.0 0.0 13.9 24.6

pred. bins

assigned sequences in bp

70

% macro-precision and macro-recall

90 80

stdev

CARMA3 binning on simulated metagenome with 49 species

100

assigned sequences in bp

% macro-precision and macro-recall

CARMA3 binning on simulated metagenome with 49 species

macro unknown precision (bp) α=0.99 0 100.0 0 87.7 0 27.9 0 10.0 0 0.0 0 0.0 0 0.0 0 0.0 0 17.9 0 28.2

(f) new order scenario

70 60 50 40 30 20 10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 28 of 51

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 2.4 1.7

62255.4 85269.4 5415.0 3302.9 6038.6 6638.4 18552.9 27832.6 153049.7 215305.1

0.0 8388.1 3937.6 1523.4 1545.7 2415.9 6525.4 27536.7 51872.9 51872.9

macro unknown precision (bp) α=0.99 0 100.0 0 97.8 0 89.4 0 61.7 0 43.3 0 22.4 0 9.3 0 5.4 0 47.0 0 53.7

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.1 8.3 41.3 45.1 38.7 27.9 21.9 26.2 22.9

1 2 19 33 66 139 400 824 211.9 185.5

100.0 64.8 35.9 34.3 32.3 28.0 21.2 11.8 32.6 41.0

0.0 21.3 14.3 9.9 9.3 8.0 5.9 4.5 10.4 9.1

1 2 20 23 32 36 41 49 29.0 25.5

(a) summary scenario sum true (bp)

sum false (bp)

overall prec.

description

232794.3

8388.1

96.5

root+superkingdom

14756.4

7006.7

67.8

phylum+class+order

53023.9

36478.0

59.2

family+genus+species

74.7 80.6

all but unassigned all with unassigned

Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 5.8 5.7

595 853 515 399 2034 5388 62555 194828 266572 267167

0 0 1 0 0 0 0 10 11 11

50 40 30 20 10 0

unassigned superkingdom

phylum

class

order

family

genus

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 4.3 3.7

22828 34238 3671 3130 9167 20053 67315 0 137574 160402

0 3128 867 282 388 979 3069 98063 106776 106776

0 0 0 0 0 0 0 0 0 0

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.1 1.9 1.8 34.4 47.3 41.1 0.0 18.1 15.8

1 2 17 21 35 55 121 218 67.0 58.8

100.0 87.1 67.3 70.8 70.5 65.3 49.0 0.0 58.6 63.8

0.0 9.1 33.8 29.0 31.9 34.3 40.8 0.0 25.6 22.4

1 2 20 23 32 36 41 49 29.0 25.5

(c) new species scenario sum true (bp)

sum false (bp)

overall prec.

description

91304

3128

96.7

root+superkingdom

15968

1537

91.2

phylum+class+order

87368

102111

46.1

family+genus+species

56.3 60.0

all but unassigned all with unassigned

70 60 50 40 30 20 10 order

0.0 0.1 0.1 0.1 0.1 0.3 1.6 31.3 4.8 4.2

1 2 20 23 32 36 41 49 29.0 25.5

description

2301

0

100.0

root+superkingdom

2948

1

100.0

phylum+class+order

262771

10

100.0

family+genus+species

100.0 100.0

all but unassigned all with unassigned

60 50 40 30 20

false (bp) true (bp) macro precision α=0.99 macro recall

10 unassigned superkingdom

phylum

class

order

family

genus

species

Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

family

genus

species

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 2.5 1.9

58109 89947 8288 6666 14535 21028 0 0 140464 198573

0 6636 1861 657 997 3013 20343 35098 68605 68605

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 98.8 93.1 76.6 52.0 22.3 0.0 0.0 49.0 55.3

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 9.2 34.7 45.1 38.5 0.0 0.0 18.2 15.9

1 2 15 25 50 105 274 430 128.7 112.8

100.0 74.5 40.0 41.8 41.1 30.6 0.0 0.0 32.6 41.0

0.0 15.4 29.9 26.2 29.8 30.6 0.0 0.0 18.8 16.5

1 2 20 23 32 36 41 49 29.0 25.5

(d) new genus scenario sum true (bp)

sum false (bp)

overall prec.

description

238003

6636

97.3

root+superkingdom

29489

3515

89.3

phylum+class+order

21028

58454

26.5

family+genus+species

67.2 74.3

all but unassigned all with unassigned

MEGAN binning of simulated metagenome with 49 species

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

assigned sequences in bp

% macro-precision and macro-recall

80

class

100.0 99.9 99.9 99.9 99.9 99.8 99.3 82.5 97.3 97.7

overall prec.

taxonomic rank

90

phylum

1 2 19 23 31 35 40 44 27.7 24.4

sum false (bp)

70

MEGAN binning of simulated metagenome with 49 species

unassigned superkingdom

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

sum true (bp)

80

0

100

0

real bins

100 90 80

assigned sequences in bp

rank

stdev

90

species

Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e)

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

macro recall

100

taxonomic rank

macro precision α=0.99 100.0 99.6 98.4 98.4 83.5 58.9 22.9 0.0 66.0 70.2

pred. bins

assigned sequences in bp

60

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

assigned sequences in bp

% macro-precision and macro-recall

90

70

stdev

MEGAN binning of simulated metagenome with 49 species

MEGAN binning of simulated metagenome with 49 species 100

80

macro unknown precision (bp) α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0

(b) all reference scenario

70 60 50 40 30 20 10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 29 of 51

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 1.9 1.3

73809 111523 10028 7514 16534 0 0 0 145599 219408

0 8005 2966 1043 1481 5906 8487 19882 47770 47770

macro unknown precision (bp) α=0.99 0 100.0 0 97.2 0 78.2 0 37.5 0 17.5 0 0.0 0 0.0 0 0.0 0 32.9 0 41.3

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.4 27.3 42.3 32.9 0.0 0.0 0.0 14.7 12.9

1 2 14 31 69 161 366 580 174.7 153.0

100.0 56.5 23.5 20.4 14.9 0.0 0.0 0.0 16.5 26.9

0.0 28.9 25.8 21.9 21.1 0.0 0.0 0.0 14.0 12.2

1 2 20 23 32 36 41 49 29.0 25.5

(e) new family scenario sum true (bp)

sum false (bp)

overall prec.

description

296855

8005

97.4

root+superkingdom

34076

5490

86.1

phylum+class+order

0

34275

0.0

family+genus+species

75.3 82.1

all but unassigned all with unassigned

Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 1.5 1.0

89682 118597 10011 5411 0 0 0 0 134019 223701

0 11005 5581 1881 2446 3416 5382 13766 43477 43477

60 50 40 30 20

false (bp) true (bp) macro precision α=0.99 macro recall

10 0

unassigned superkingdom

phylum

class

order

family

genus

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 1.4 0.9

94817 122665 5392 0 0 0 0 0 128057 222874

0 11467 7208 4356 2145 2203 4437 12488 44304 44304

0 0 0 0 0 0 0 0 0 0

stdev

pred. bins

macro recall

stdev

real bins

0.0 2.8 30.7 0.0 0.0 0.0 0.0 0.0 4.8 4.2

1 2 19 43 88 172 446 657 203.9 178.5

100.0 45.9 6.4 0.0 0.0 0.0 0.0 0.0 7.5 19.0

0.0 31.8 11.6 0.0 0.0 0.0 0.0 0.0 6.2 5.4

1 2 20 23 32 36 41 49 29.0 25.5

(g) new class scenario sum true (bp)

sum false (bp)

overall prec.

description

340147

11467

96.7

root+superkingdom

5392

13709

28.2

phylum+class+order

0

19128

0.0

family+genus+species

74.3 83.4

all but unassigned all with unassigned

70 60 50 40 30 20 10 order

0.0 30.9 21.7 11.7 0.0 0.0 0.0 0.0 9.2 8.0

1 2 20 23 32 36 41 49 29.0 25.5

description

326876

11005

96.7

root+superkingdom

15422

9908

60.9

phylum+class+order

0

22564

0.0

family+genus+species

75.5 83.7

all but unassigned all with unassigned

60 50 40 30 20

false (bp) true (bp) macro precision α=0.99 macro recall

10 unassigned superkingdom

phylum

class

order

family

genus

species

Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

family

genus

species

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 1.3 0.8

95948 119063 0 0 0 0 0 0 119063 215011

0 18476 9079 2445 3363 1394 3960 13450 52167 52167

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 84.6 0.0 0.0 0.0 0.0 0.0 0.0 12.1 23.1

stdev

pred. bins

macro recall

stdev

real bins

0.0 8.8 0.0 0.0 0.0 0.0 0.0 0.0 1.3 1.1

1 2 25 45 96 197 494 814 239.0 209.3

100.0 39.5 0.0 0.0 0.0 0.0 0.0 0.0 5.6 17.4

0.0 33.1 0.0 0.0 0.0 0.0 0.0 0.0 4.7 4.1

1 2 20 23 32 36 41 49 29.0 25.5

(h) new phylum scenario sum true (bp)

sum false (bp)

overall prec.

description

334074

18476

94.8

root+superkingdom

0

14887

0.0

phylum+class+order

0

18804

0.0

family+genus+species

69.5 80.5

all but unassigned all with unassigned

MEGAN binning of simulated metagenome with 49 species

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

assigned sequences in bp

% macro-precision and macro-recall

80

class

100.0 49.8 13.8 7.0 0.0 0.0 0.0 0.0 10.1 21.3

overall prec.

taxonomic rank

90

phylum

1 2 17 39 84 167 401 565 182.1 159.5

sum false (bp)

70

MEGAN binning of simulated metagenome with 49 species

unassigned superkingdom

0.0 1.3 37.8 28.3 0.0 0.0 0.0 0.0 9.6 8.4

sum true (bp)

80

0

100

0

real bins

100 90 80

assigned sequences in bp

rank

stdev

90

species

Supplementary Figure S12 - MEGAN binning of simulated metagenome with 49 species (simArt49e)

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

macro recall

100

taxonomic rank

macro precision α=0.99 100.0 93.2 25.1 0.0 0.0 0.0 0.0 0.0 16.9 27.3

pred. bins

assigned sequences in bp

70

% macro-precision and macro-recall

90 80

stdev

MEGAN binning of simulated metagenome with 49 species

100

assigned sequences in bp

% macro-precision and macro-recall

MEGAN binning of simulated metagenome with 49 species

macro unknown precision (bp) α=0.99 0 100.0 0 95.5 0 47.4 0 14.4 0 0.0 0 0.0 0 0.0 0 0.0 0 22.5 0 32.2

(f) new order scenario

70 60 50 40 30 20 10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 30 of 51

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 2.1 1.5

75644.4 106494.7 7691.6 3656.6 5832.6 7550.6 20271.9 11952.3 163450.1 239094.6

0.0 10293.3 9493.1 2344.0 2014.3 1079.7 1397.1 1461.9 28083.4 28083.4

macro unknown precision (bp) α=0.99 0 100.0 0 96.9 0 94.9 0 91.2 0 85.9 0 76.4 0 65.9 0 61.1 0 81.8 0 84.0

stdev

pred. bins

macro recall

stdev

real bins

0.0 2.5 9.2 21.5 31.8 39.8 46.4 47.2 28.3 24.8

1 2 16 21 34 44 58 65 34.3 30.1

100.0 56.8 18.2 18.3 16.2 13.8 9.4 2.5 19.3 29.4

0.0 33.5 13.5 11.6 9.5 8.2 7.7 4.4 12.6 11.0

1 2 20 23 32 36 41 49 29.0 25.5

(a) summary scenario sum true (bp)

sum false (bp)

overall prec.

description

288633.9

10293.3

96.6

root+superkingdom

17180.7

13851.4

55.4

phylum+class+order

39774.7

3938.7

91.0

family+genus+species

85.3 89.5

all but unassigned all with unassigned

Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 4.4 3.6

34453 35661 2897 1947 8254 19632 80667 83666 232724 267177

0 0 0 0 0 0 0 1 1 1

50 40 30 20 10 0

unassigned superkingdom

phylum

class

order

family

genus

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 3.4 2.6

63523 75519 8526 6516 14335 21470 61236 0 187602 251125

0 4247 1834 515 322 246 1365 7524 16053 16053

0 0 0 0 0 0 0 0 0 0

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.9 1.4 0.5 0.5 0.8 30.3 0.0 4.9 4.3

1 2 15 18 25 28 26 48 23.1 20.4

100.0 74.7 39.3 42.8 39.9 34.9 22.6 0.0 36.3 44.3

0.0 16.4 28.6 27.4 26.9 28.6 28.3 0.0 22.3 19.5

1 2 20 23 32 36 41 49 29.0 25.5

(c) new species scenario sum true (bp)

sum false (bp)

overall prec.

description

214561

4247

98.1

root+superkingdom

29377

2671

91.7

phylum+class+order

82706

9135

90.1

family+genus+species

92.1 94.0

all but unassigned all with unassigned

70 60 50 40 30 20 10 order

0.0 8.4 28.4 29.1 30.6 31.5 34.0 30.5 27.5 24.1

1 2 20 23 32 36 41 49 29.0 25.5

description

105775

0

100.0

root+superkingdom

13098

0

100.0

phylum+class+order

183965

1

100.0

family+genus+species

100.0 100.0

all but unassigned all with unassigned

60 50 40 30 20

false (bp) true (bp) macro precision α=0.99 macro recall

10 unassigned superkingdom

phylum

class

order

family

genus

species

Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

family

genus

species

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 1.8 1.3

82737 117640 12404 6555 12254 11752 0 0 160605 243342

0 8730 4567 1439 1241 1508 4633 1718 23836 23836

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 98.1 98.2 98.5 97.1 56.2 0.0 0.0 64.0 68.5

stdev

pred. bins

macro recall

stdev

real bins

0.0 1.4 3.3 1.9 4.9 46.1 0.0 0.0 8.2 7.2

1 2 13 16 22 33 52 49 26.7 23.5

100.0 61.5 20.1 20.9 17.6 9.5 0.0 0.0 18.5 28.7

0.0 31.7 21.3 21.8 20.2 16.7 0.0 0.0 16.0 14.0

1 2 20 23 32 36 41 49 29.0 25.5

(d) new genus scenario sum true (bp)

sum false (bp)

overall prec.

description

318017

8730

97.3

root+superkingdom

31213

7247

81.2

phylum+class+order

11752

7859

59.9

family+genus+species

87.1 91.1

all but unassigned all with unassigned

taxator-tk binning of simulated metagenome with 49 species

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

assigned sequences in bp

% macro-precision and macro-recall

80

class

100.0 68.1 46.5 52.5 51.7 52.1 43.4 17.7 47.4 54.0

overall prec.

taxonomic rank

90

phylum

1 2 17 21 29 32 34 34 24.1 21.3

sum false (bp)

70

taxator-tk binning of simulated metagenome with 49 species

unassigned superkingdom

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

sum true (bp)

80

0

100

0

real bins

100 90 80

assigned sequences in bp

rank

stdev

90

species

Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e)

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

macro recall

100

taxonomic rank

macro precision α=0.99 100.0 99.1 99.4 99.7 99.7 99.7 88.7 0.0 83.8 85.8

pred. bins

assigned sequences in bp

60

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

assigned sequences in bp

% macro-precision and macro-recall

90

70

stdev

taxator-tk binning of simulated metagenome with 49 species

taxator-tk binning of simulated metagenome with 49 species 100

80

macro unknown precision (bp) α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0

(b) all reference scenario

70 60 50 40 30 20 10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 31 of 51

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 1.4 1.0

86887 129189 13258 6610 5985 0 0 0 155042 241929

0 10849 6399 1866 1537 2768 1474 356 25249 25249

macro unknown precision (bp) α=0.99 0 100.0 0 95.9 0 95.0 0 96.0 0 38.6 0 0.0 0 0.0 0 0.0 0 46.5 0 53.2

stdev

pred. bins

macro recall

stdev

real bins

0.0 2.0 6.2 4.4 46.6 0.0 0.0 0.0 8.5 7.4

1 2 8 12 27 48 85 81 37.6 33.0

100.0 50.8 12.1 8.9 4.3 0.0 0.0 0.0 10.9 22.0

0.0 42.7 19.5 14.0 8.3 0.0 0.0 0.0 12.1 10.6

1 2 20 23 32 36 41 49 29.0 25.5

(e) new family scenario sum true (bp)

sum false (bp)

overall prec.

description

345265

10849

97.0

root+superkingdom

25853

9802

72.5

phylum+class+order

0

4598

0.0

family+genus+species

86.0 90.5

all but unassigned all with unassigned

Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 1.3 0.9

83674 131010 11454 3968 0 0 0 0 146432 230106

0 14172 14889 3071 2444 1364 901 231 37072 37072

60 50 40 30 20

false (bp) true (bp) macro precision α=0.99 macro recall

10 0

unassigned superkingdom

phylum

class

order

family

genus

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 1.3 0.8

88280 132003 5302 0 0 0 0 0 137305 225585

0 14601 17992 4839 2332 961 672 196 41593 41593

0 0 0 0 0 0 0 0 0 0

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 32.3 0.0 0.0 0.0 0.0 0.0 4.6 4.0

1 1 5 17 41 73 107 74 45.4 39.9

100.0 47.9 2.3 0.0 0.0 0.0 0.0 0.0 7.2 18.8

0.0 44.8 5.5 0.0 0.0 0.0 0.0 0.0 7.2 6.3

1 2 20 23 32 36 41 49 29.0 25.5

(g) new class scenario sum true (bp)

sum false (bp)

overall prec.

description

352286

14601

96.0

root+superkingdom

5302

25163

17.4

phylum+class+order

0

1829

0.0

family+genus+species

76.8 84.4

all but unassigned all with unassigned

70 60 50 40 30 20 10 order

0.0 45.1 16.0 7.2 0.0 0.0 0.0 0.0 9.7 8.5

1 2 20 23 32 36 41 49 29.0 25.5

description

345694

14172

96.1

root+superkingdom

15422

20404

43.0

phylum+class+order

0

2496

0.0

family+genus+species

79.8 86.1

all but unassigned all with unassigned

60 50 40 30 20

false (bp) true (bp) macro precision α=0.99 macro recall

10 unassigned superkingdom

phylum

class

order

family

genus

species

Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

family

genus

species

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 1.3 0.9

89957 124441 0 0 0 0 0 0 124441 214398

0 19454 20771 4678 6224 711 735 207 52780 52780

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 89.7 0.0 0.0 0.0 0.0 0.0 0.0 12.8 23.7

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 1 6 15 27 94 141 100 54.9 48.1

100.0 45.6 0.0 0.0 0.0 0.0 0.0 0.0 6.5 18.2

0.0 45.0 0.0 0.0 0.0 0.0 0.0 0.0 6.4 5.6

1 2 20 23 32 36 41 49 29.0 25.5

(h) new phylum scenario sum true (bp)

sum false (bp)

overall prec.

description

338839

19454

94.6

root+superkingdom

0

31673

0.0

phylum+class+order

0

1653

0.0

family+genus+species

70.2 80.2

all but unassigned all with unassigned

taxator-tk binning of simulated metagenome with 49 species

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

assigned sequences in bp

% macro-precision and macro-recall

80

class

100.0 48.7 7.2 2.9 0.0 0.0 0.0 0.0 8.4 19.8

overall prec.

taxonomic rank

90

phylum

1 1 6 16 38 70 103 83 45.3 39.8

sum false (bp)

70

taxator-tk binning of simulated metagenome with 49 species

unassigned superkingdom

0.0 0.0 25.5 40.1 0.0 0.0 0.0 0.0 9.4 8.2

sum true (bp)

80

0

100

0

real bins

100 90 80

assigned sequences in bp

rank

stdev

90

species

Supplementary Figure S13 - Taxator-tk binning of simulated metagenome with 49 species (simArt49e)

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

macro recall

100

taxonomic rank

macro precision α=0.99 100.0 91.8 51.5 0.0 0.0 0.0 0.0 0.0 20.5 30.4

pred. bins

assigned sequences in bp

70

% macro-precision and macro-recall

90 80

stdev

taxator-tk binning of simulated metagenome with 49 species

100

assigned sequences in bp

% macro-precision and macro-recall

taxator-tk binning of simulated metagenome with 49 species

macro unknown precision (bp) α=0.99 0 100.0 0 92.7 0 73.6 0 36.7 0 0.0 0 0.0 0 0.0 0 0.0 0 29.0 0 37.9

(f) new order scenario

70 60 50 40 30 20 10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 32 of 51

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

Supplementary Figure S14: Bin precision plots for 49 species simulated metagenomic sample (simArt49e)

Supplementary Figure S14: Bin precision plots for 49 species simulated metagenomic sample (simArt49e)

Supplementary Figure S14: Bin precision plots for 49 species simulated metagenomic sample (simArt49e)

Supplementary Figure S14: Bin precision plots for 49 species simulated metagenomic sample (simArt49e) Comparison of assignment quality of CARMA3, MEGAN4 and taxator-tk for a simulated metagenome sample from a 49 species microbial community. Values are shown for the summary scenario (sum of all seven cross-validation scenarios), for assignments to the (a) species, (b) genus, (c) family, (d) order, (e) class and (f) phylum ranks, respectively. The first of each panels shows the precision and size for every predicted bin (after removing low abundance bins). The colored line shows a smoothed k-nearest-neighbor estimate of the mean precision as a function of predicted bin size using the R function wapply (width=0.3) followed by smooth.spline (df=10). The second panel for each rank shows bin precisions relative to recall. The F-score partitioning helps to identify similar quality bins if precision and recall are equally weighted, however we consider precision more important than recall. The third panel illustrates the total number of true (blue) and false (red) and unassigned (gray) portion of assignments at the respective ranks. Note that partially incorrect assignments are considered incorrect for the low ranking false part of the assignment and correct for the higher ranks.

taxator-tk Supplementary Material Page 33 of 51

Supplementary Figure S15: Taxonomic composition of SimMC/AMD

all

Methanococcoides burtonii 0.01%

Taxonomic composition of the FAMeS simulated metagenome sample SimMC/AMD using Krona (Ondov et al., 2011). An interactive version can be found in the supplementary files (SimMC.krona.html). Abundance is measured in terms of accumulated contigs lengths.

taxator-tk Supplementary Material Page 34 of 51

Supplementary Figure S16: Taxonomic composition of SimHC/soil

all

Taxonomic composition of the FAMeS simulated metagenome sample SimHC/soil using Krona (Ondov et al., 2011). An interactive version can be found in the supplementary files (SimHC.krona.html). Abundance is measured in terms of accumulated contigs lengths.

taxator-tk Supplementary Material Page 35 of 51

Supplementary Figure S17 - MEGAN binning for FAMeS SimMC true (kb)

false (kb)

0 1 2 3 4 5 6 7 3.3 3.1

877.9 2428.7 2508.3 1611.6 484.1 1590.7 811.4 2332.3 11767.1 12645.0

0.0 7.5 60.0 389.1 646.1 617.3 1102.6 1572.8 4395.3 4395.3

macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 18.7 0 14.8 0 9.8 0 6.1 0 3.9 0 3.0 0 22.3 0 32.0

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 32.6 29.4 23.8 21.5 18.0 16.5 20.2 17.7

1 1 8 17 39 69 131 188 64.7 56.8

100.0 45.0 35.4 24.3 15.6 7.5 3.5 1.8 19.0 29.1

0.0 45.0 23.9 19.7 16.5 13.0 7.2 4.7 18.6 16.2

1 2 8 12 23 30 37 47 22.7 20.0

sum true (kb)

sum false (kb)

overall prec.

description

5735.3

7.5

99.9

root+superkingdom

4604.0

1095.1

80.8

phylum+class+order

4734.4

3292.7

59.0

family+genus+species

72.8 74.2

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (kb)

false (kb)

0 1 2 3 4 5 6 7 5.6 5.6

2.03 9.44 21.2 34.76 16.19 28.28 602.48 16325.9 17038.25 17040.28

0 0 0 0 0 0 0 0 0 0

70

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 phylum

class

order

family

genus

species

false (kb) true (kb) macro precision α=0.99 macro recall

% macro-precision and macro-recall

14,000

80

assigned sequences in kb

% macro-precision and macro-recall

16,000

90

unassigned superkingdom

rank

depth

true (kb)

false (kb)

unknown (kb)

0 1 2 3 4 5 6 7 5.0 4.9

234.56 273.59 1162.26 683.93 576.89 4640.58 5077.13 0 12414.38 12648.94

0 0 2.62 59.82 63.11 256.49 1966.57 2042.71 4391.32 4391.32

0 0 0 0 0 0 0 0 0 0

(c) new species scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 44.7 40.3 38.4 28.8 0.0 21.7 19.0

1 1 1 3 10 18 24 32 12.7 11.3

100.0 48.4 55.4 45.6 36.2 21.3 6.1 0.0 30.4 39.1

0.0 48.4 44.1 37.9 39.5 36.6 18.7 0.0 32.2 28.2

1 2 8 12 23 30 37 47 22.7 20.0

sum true (kb)

sum false (kb)

overall prec.

description

781.74

0

100.0

root+superkingdom

2423.08

125.55

95.1

phylum+class+order

9717.71

4265.77

69.5

family+genus+species

73.9 74.2

all but unassigned all with unassigned

100.0 50.0 49.3 49.2 33.9 19.7 18.7 12.5 33.3 41.7

0.0 50.0 49.3 49.2 46.5 39.5 38.7 32.6 43.7 38.2

1 2 8 12 23 30 37 47 22.7 20.0

overall prec.

description

20.91

0

100.0

root+superkingdom

72.15

0

100.0

phylum+class+order

16956.66

0

100.0

family+genus+species

100.0 100.0

all but unassigned all with unassigned

14,000

70

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (kb) true (kb) macro precision α=0.99 macro recall

0

Supplementary Figure S17 - MEGAN binning for FAMeS SimMC rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 family

genus

species

0

false (kb) true (kb) macro precision α=0.99 macro recall

% macro-precision and macro-recall

14,000

70

assigned sequences in kb

% macro-precision and macro-recall

80

order

1 1 1 2 3 3 4 4 2.6 2.4

sum false (kb)

depth

true (kb)

false (kb)

unknown (kb)

0 1 2 3 4 5 6 7 4.3 4.2

358.62 526.48 1889.35 1360.44 1314.65 6466.24 0 0 11557.16 11915.78

0 0 2.62 89.37 128.45 303.88 2126.87 2473.3 5124.49 5124.49

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 100.0 100.0 65.6 31.8 13.9 0.0 0.0 44.5 51.4

(d) new genus scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 45.6 40.5 32.2 0.0 0.0 16.9 14.8

1 1 1 3 11 17 39 45 16.7 14.8

100.0 47.7 54.3 40.3 26.2 11.2 0.0 0.0 25.7 34.9

0.0 47.7 43.3 35.8 31.9 28.1 0.0 0.0 26.7 23.3

1 2 8 12 23 30 37 47 22.7 20.0

sum true (kb)

sum false (kb)

overall prec.

description

1411.58

0

100.0

root+superkingdom

4564.44

220.44

95.4

phylum+class+order

6466.24

4904.05

56.9

family+genus+species

69.3 69.9

all but unassigned all with unassigned

MEGAN binning for FAMeS SimMC 16,000

class

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

sum true (kb)

taxonomic rank

90

phylum

real bins

16,000

MEGAN binning for FAMeS SimMC

unassigned superkingdom

stdev

80

0

100

0

macro recall

90

0

Supplementary Figure S17 - MEGAN binning for FAMeS SimMC

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

pred. bins

100

taxonomic rank

macro precision α=0.99 100.0 100.0 100.0 66.8 36.1 21.0 11.0 0.0 47.8 54.4

stdev

MEGAN binning for FAMeS SimMC

MEGAN binning for FAMeS SimMC 100

0

macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0

(b) all reference scenario

assigned sequences in kb

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S17 - MEGAN binning for FAMeS SimMC

100 16,000

90 80

14,000

70

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 36 of 51

family

genus

species

0

assigned sequences in kb

rank

(a) summary scenario

false (kb) true (kb) macro precision α=0.99 macro recall

Supplementary Figure S17 - MEGAN binning for FAMeS SimMC true (kb)

false (kb)

0 1 2 3 4 5 6 7 2.9 2.8

663.34 1775.34 4398.98 4868.55 1480.8 0 0 0 12523.67 13187.01

0 0 13.81 130.75 325.61 1031.59 839.84 1511.68 3853.28 3853.28

macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 52.5 0 38.0 0 9.8 0 0.0 0 0.0 0 0.0 0 28.6 0 37.5

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 47.5 42.5 23.2 0.0 0.0 0.0 16.2 14.1

1 1 2 5 18 28 47 47 21.1 18.6

100.0 45.8 25.8 23.4 12.7 0.0 0.0 0.0 15.4 26.0

0.0 45.8 33.8 24.1 21.0 0.0 0.0 0.0 17.8 15.6

1 2 8 12 23 30 37 47 22.7 20.0

sum true (kb)

sum false (kb)

overall prec.

description

4214.02

0

100.0

root+superkingdom

10748.33

470.17

95.8

phylum+class+order

0

3383.11

0.0

family+genus+species

76.5 77.4

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (kb)

false (kb)

0 1 2 3 4 5 6 7 2.7 2.5

767.74 2271.96 5432.33 4333.64 0 0 0 0 12037.93 12805.67

0 0 39.29 193 561.43 1006.4 817.67 1616.82 4234.61 4234.61

14,000

80 70

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (kb) true (kb) macro precision α=0.99 macro recall

% macro-precision and macro-recall

16,000

90

0

rank

depth

true (kb)

false (kb)

unknown (kb)

0 1 2 3 4 5 6 7 2.5 2.3

1274.66 4959.38 4654.01 0 0 0 0 0 9613.39 10888.05

0 0 87.88 1661.84 1445.26 960.19 1106.84 890.22 6152.23 6152.23

0 0 0 0 0 0 0 0 0 0

(g) new class scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 34.7 0.0 0.0 0.0 0.0 0.0 5.0 4.3

1 1 7 12 31 40 57 46 27.7 24.4

100.0 42.4 21.4 0.0 0.0 0.0 0.0 0.0 9.1 20.5

0.0 42.4 34.5 0.0 0.0 0.0 0.0 0.0 11.0 9.6

1 2 8 12 23 30 37 47 22.7 20.0

sum true (kb)

sum false (kb)

overall prec.

description

11193.42

0

100.0

root+superkingdom

4654.01

3194.98

59.3

phylum+class+order

0

2957.25

0.0

family+genus+species

61.0 63.9

all but unassigned all with unassigned

100.0 45.1 41.6 11.5 0.0 0.0 0.0 0.0 14.0 24.8

0.0 45.1 40.0 13.9 0.0 0.0 0.0 0.0 14.1 12.4

1 2 8 12 23 30 37 47 22.7 20.0

overall prec.

description

5311.66

0

100.0

root+superkingdom

9765.97

793.72

92.5

phylum+class+order

0

3440.89

0.0

family+genus+species

74.0 75.1

all but unassigned all with unassigned

14,000

70

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (kb) true (kb) macro precision α=0.99 macro recall

0

Supplementary Figure S17 - MEGAN binning for FAMeS SimMC rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 family

genus

species

0

false (kb) true (kb) macro precision α=0.99 macro recall

% macro-precision and macro-recall

14,000

70

assigned sequences in kb

% macro-precision and macro-recall

80

order

1 1 3 7 21 30 43 39 20.6 18.1

sum false (kb)

depth

true (kb)

false (kb)

unknown (kb)

0 1 2 3 4 5 6 7 2.3 1.8

2844.15 7184.81 0 0 0 0 0 0 7184.81 10028.96

0 52.43 273.68 588.78 1998.56 762.4 860.34 2475.13 7011.32 7011.32

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 14.3 25.0

(h) new phylum scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 1 14 26 40 61 72 69 40.4 35.5

100.0 35.3 0.0 0.0 0.0 0.0 0.0 0.0 5.0 16.9

0.0 35.3 0.0 0.0 0.0 0.0 0.0 0.0 5.0 4.4

1 2 8 12 23 30 37 47 22.7 20.0

sum true (kb)

sum false (kb)

overall prec.

description

17213.77

52.43

99.7

root+superkingdom

0

2861.02

0.0

phylum+class+order

0

4097.87

0.0

family+genus+species

50.6 58.9

all but unassigned all with unassigned

MEGAN binning for FAMeS SimMC 16,000

class

0.0 0.0 43.7 36.9 0.0 0.0 0.0 0.0 11.5 10.1

sum true (kb)

taxonomic rank

90

phylum

real bins

16,000

MEGAN binning for FAMeS SimMC

unassigned superkingdom

stdev

80

0

100

0

macro recall

90

0

Supplementary Figure S17 - MEGAN binning for FAMeS SimMC

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

pred. bins

100

taxonomic rank

macro precision α=0.99 100.0 100.0 15.0 0.0 0.0 0.0 0.0 0.0 16.4 26.9

stdev

MEGAN binning for FAMeS SimMC

100

assigned sequences in kb

% macro-precision and macro-recall

MEGAN binning for FAMeS SimMC

macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 38.6 0 25.7 0 0.0 0 0.0 0 0.0 0 0.0 0 23.5 0 33.0

(f) new order scenario

assigned sequences in kb

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S17 - MEGAN binning for FAMeS SimMC

100 16,000

90 80

14,000

70

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 37 of 51

family

genus

species

0

assigned sequences in kb

rank

(e) new family scenario

false (kb) true (kb) macro precision α=0.99 macro recall

Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC true (kb)

false (kb)

0 1 2 3 4 5 6 7 2.6 2.3

2083.8 4704.3 3460.1 1860.6 560.8 1573.3 1012.7 978.0 14149.8 16233.6

0.0 0.6 26.9 182.2 257.2 89.4 196.9 53.3 806.6 806.6

macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 52.0 0 49.0 0 40.7 0 22.8 0 37.5 0 39.2 0 48.7 0 55.2

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 36.2 47.6 41.6 38.3 45.7 48.4 36.8 32.2

1 1 4 4 15 19 19 54 16.6 14.6

100.0 62.5 38.1 24.0 18.6 12.8 8.0 5.0 24.1 33.6

0.0 19.6 13.3 14.2 12.0 10.8 7.9 6.3 12.0 10.5

1 2 8 12 23 30 37 47 22.7 20.0

sum true (kb)

sum false (kb)

overall prec.

description

11492.5

0.6

100.0

root+superkingdom

5881.5

466.4

92.7

phylum+class+order

3564.0

339.7

91.3

family+genus+species

94.6 95.3

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (kb)

false (kb)

0 1 2 3 4 5 6 7 4.0 3.9

251.07 1303.39 1673.66 1129.62 647.31 1728.38 3460.93 6845.91 16789.2 17040.27

0 0 0 0 0 0 0 0 0 0

70

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 phylum

class

order

family

genus

species

false (kb) true (kb) macro precision α=0.99 macro recall

% macro-precision and macro-recall

14,000

80

assigned sequences in kb

% macro-precision and macro-recall

16,000

90

unassigned superkingdom

rank

depth

true (kb)

false (kb)

unknown (kb)

0 1 2 3 4 5 6 7 4.0 3.6

1558.82 1613.56 2761.05 1806.58 1024.51 3915.66 3628.14 0 14749.5 16308.32

0 0 2.62 27.87 19.75 33.96 630.72 17.05 731.97 731.97

0 0 0 0 0 0 0 0 0 0

(c) new species scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 42.1 35.6 36.6 44.8 0.0 22.7 19.9

1 1 1 3 7 6 3 11 4.6 4.1

100.0 95.8 69.6 39.9 32.4 22.0 8.8 0.0 38.4 46.1

0.0 4.2 30.2 35.3 34.2 34.7 24.1 0.0 23.2 20.3

1 2 8 12 23 30 37 47 22.7 20.0

sum true (kb)

sum false (kb)

overall prec.

description

4785.94

0

100.0

root+superkingdom

5592.14

50.24

99.1

phylum+class+order

7543.8

681.73

91.7

family+genus+species

95.3 95.7

all but unassigned all with unassigned

100.0 98.8 85.8 73.6 68.7 59.0 47.1 34.9 66.8 71.0

0.0 1.2 15.8 22.1 29.4 42.0 43.6 44.2 28.3 24.8

1 2 8 12 23 30 37 47 22.7 20.0

overall prec.

description

2857.85

0

100.0

root+superkingdom

3450.59

0

100.0

phylum+class+order

12035.22

0

100.0

family+genus+species

100.0 100.0

all but unassigned all with unassigned

14,000

70

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (kb) true (kb) macro precision α=0.99 macro recall

0

Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 family

genus

species

0

false (kb) true (kb) macro precision α=0.99 macro recall

% macro-precision and macro-recall

14,000

70

assigned sequences in kb

% macro-precision and macro-recall

80

order

1 1 1 2 4 3 5 5 3.0 2.8

sum false (kb)

depth

true (kb)

false (kb)

unknown (kb)

0 1 2 3 4 5 6 7 3.4 3.1

1745.53 1976.55 3398.87 2312.46 1189.02 5368.74 0 0 14245.64 15991.17

0 0 2.62 32.37 40.59 54.74 636.48 282.3 1049.1 1049.1

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 100.0 100.0 66.9 55.5 50.0 0.0 0.0 53.2 59.0

(d) new genus scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 44.9 42.8 46.3 0.0 0.0 19.2 16.8

1 1 1 3 7 7 15 9 6.1 5.5

100.0 94.6 63.2 30.2 22.2 8.3 0.0 0.0 31.2 39.8

0.0 5.4 32.7 32.2 32.0 22.3 0.0 0.0 17.8 15.6

1 2 8 12 23 30 37 47 22.7 20.0

sum true (kb)

sum false (kb)

overall prec.

description

5698.63

0

100.0

root+superkingdom

6900.35

75.58

98.9

phylum+class+order

5368.74

973.52

84.7

family+genus+species

93.1 93.8

all but unassigned all with unassigned

taxator-tk binning for FAMeS SimMC 16,000

class

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

sum true (kb)

taxonomic rank

90

phylum

real bins

16,000

taxator-tk binning for FAMeS SimMC

unassigned superkingdom

stdev

80

0

100

0

macro recall

90

0

Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

pred. bins

100

taxonomic rank

macro precision α=0.99 100.0 100.0 100.0 69.1 75.1 81.0 63.1 0.0 69.7 73.5

stdev

taxator-tk binning for FAMeS SimMC

taxator-tk binning for FAMeS SimMC 100

0

macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0 0 100.0

(b) all reference scenario

assigned sequences in kb

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC

100 16,000

90 80

14,000

70

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 38 of 51

family

genus

species

0

assigned sequences in kb

rank

(a) summary scenario

false (kb) true (kb) macro precision α=0.99 macro recall

Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC true (kb)

false (kb)

0 1 2 3 4 5 6 7 2.2 2.0

1444.21 4049.76 5463.82 4607.11 1065.02 0 0 0 15185.71 16629.92

0 0 11.04 80.28 96.56 179.84 32.8 9.85 410.37 410.37

macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 100.0 0 61.6 0 20.6 0 0.0 0 0.0 0 0.0 0 40.3 0 47.8

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 43.9 37.4 0.0 0.0 0.0 11.6 10.2

1 1 1 3 18 21 14 7 9.3 8.3

100.0 42.8 31.1 21.0 6.8 0.0 0.0 0.0 14.5 25.2

0.0 42.8 36.5 31.0 17.0 0.0 0.0 0.0 18.2 15.9

1 2 8 12 23 30 37 47 22.7 20.0

sum true (kb)

sum false (kb)

overall prec.

description

9543.73

0

100.0

root+superkingdom

11135.95

187.88

98.3

phylum+class+order

0

222.49

0.0

family+genus+species

97.4 97.6

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (kb)

false (kb)

0 1 2 3 4 5 6 7 2.0 1.8

1665.24 5067.52 6525.5 3168.64 0 0 0 0 14761.66 16426.9

0 0 14.12 86.66 288.11 169.88 40.01 14.6 613.38 613.38

14,000

80 70

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (kb) true (kb) macro precision α=0.99 macro recall

% macro-precision and macro-recall

16,000

90

0

rank

depth

true (kb)

false (kb)

unknown (kb)

0 1 2 3 4 5 6 7 1.6 1.3

2853.36 8356.71 4397.58 0 0 0 0 0 12754.29 15607.65

0 0 25.84 659.76 597.06 108.29 23.82 17.85 1432.62 1432.62

0 0 0 0 0 0 0 0 0 0

(g) new class scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 47.1 0.0 0.0 0.0 0.0 0.0 6.7 5.9

1 1 3 11 18 21 14 9 11.0 9.8

100.0 36.9 3.2 0.0 0.0 0.0 0.0 0.0 5.7 17.5

0.0 36.9 8.5 0.0 0.0 0.0 0.0 0.0 6.5 5.7

1 2 8 12 23 30 37 47 22.7 20.0

sum true (kb)

sum false (kb)

overall prec.

description

19566.78

0

100.0

root+superkingdom

4397.58

1282.66

77.4

phylum+class+order

0

149.96

0.0

family+genus+species

89.9 91.6

all but unassigned all with unassigned

100.0 41.4 13.5 3.3 0.0 0.0 0.0 0.0 8.3 19.8

0.0 41.4 20.3 6.6 0.0 0.0 0.0 0.0 9.8 8.5

1 2 8 12 23 30 37 47 22.7 20.0

overall prec.

description

11800.28

0

100.0

root+superkingdom

9694.14

388.89

96.1

phylum+class+order

0

224.49

0.0

family+genus+species

96.0 96.4

all but unassigned all with unassigned

14,000

70

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (kb) true (kb) macro precision α=0.99 macro recall

0

Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 family

genus

species

0

false (kb) true (kb) macro precision α=0.99 macro recall

% macro-precision and macro-recall

14,000

70

assigned sequences in kb

% macro-precision and macro-recall

80

order

1 1 1 4 19 17 14 9 9.3 8.3

sum false (kb)

depth

true (kb)

false (kb)

unknown (kb)

0 1 2 3 4 5 6 7 1.3 0.9

5068.27 10562.94 0 0 0 0 0 0 10562.94 15631.21

0 4.05 132.26 388.44 758.65 79.35 14.55 31.77 1409.07 1409.07

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 14.3 25.0

(h) new phylum scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 1 10 18 19 24 17 14 14.7 13.0

100.0 27.3 0.0 0.0 0.0 0.0 0.0 0.0 3.9 15.9

0.0 27.3 0.0 0.0 0.0 0.0 0.0 0.0 3.9 3.4

1 2 8 12 23 30 37 47 22.7 20.0

sum true (kb)

sum false (kb)

overall prec.

description

26194.15

4.05

100.0

root+superkingdom

0

1279.35

0.0

phylum+class+order

0

125.67

0.0

family+genus+species

88.2 91.7

all but unassigned all with unassigned

taxator-tk binning for FAMeS SimMC 16,000

class

0.0 0.0 0.0 45.6 0.0 0.0 0.0 0.0 6.5 5.7

sum true (kb)

taxonomic rank

90

phylum

real bins

16,000

taxator-tk binning for FAMeS SimMC

unassigned superkingdom

stdev

80

0

100

0

macro recall

90

0

Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

pred. bins

100

taxonomic rank

macro precision α=0.99 100.0 100.0 33.3 0.0 0.0 0.0 0.0 0.0 19.0 29.2

stdev

taxator-tk binning for FAMeS SimMC

100

assigned sequences in kb

% macro-precision and macro-recall

taxator-tk binning for FAMeS SimMC

macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 100.0 0 45.2 0 0.0 0 0.0 0 0.0 0 0.0 0 35.0 0 43.1

(f) new order scenario

assigned sequences in kb

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S18 - Taxator-tk binning for FAMeS SimMC

100 16,000

90 80

14,000

70

12,000

60

10,000

50

8,000

40

6,000

30

4,000

20

2,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 39 of 51

family

genus

species

0

assigned sequences in kb

rank

(e) new family scenario

false (kb) true (kb) macro precision α=0.99 macro recall

Supplementary Figure S19 - MEGAN binning for FAMeS SimHC true (bp)

false (bp)

0 1 2 3 4 5 6 7 3.1 2.7

135097.7 184917.1 126186.3 105637.4 65414.1 47775.7 70368.1 110011.3 710310.2 845407.9

0.0 1240.4 17996.0 52554.0 53941.1 34408.6 42132.0 58509.7 260781.8 260781.8

macro unknown precision (bp) α=0.99 0.0 100.0 0.0 99.5 0.0 54.3 2704.9 61.0 0.0 60.5 0.0 68.1 382.3 65.3 356.5 69.9 3443.6 68.4 3443.6 72.3

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 45.2 36.9 41.5 38.6 43.0 45.6 35.8 31.3

1 1 10 12 27 36 47 47 25.7 22.6

100.0 65.7 39.8 32.6 17.4 13.1 8.4 4.7 25.9 35.2

0.0 21.3 24.2 19.7 17.2 13.3 9.4 6.5 15.9 14.0

1 2 8 12 36 52 72 96 39.7 34.9

sum true (bp)

sum false (bp)

overall prec

description

504932.0

1240.4

99.8

root+superkingdom

297237.9

124491.1

70.5

phylum+class+order

228155.2

135050.2

62.8

family+genus+species

73.1 76.4

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 4.8 4.8

0 14830 27071 58344 38434 58139 223807 660068 1080693 1080693

0 0 0 0 0 0 0 0 0 0

500,000

70 60

400,000

50

300,000

40 30

200,000

20

100,000

10 phylum

class

order

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

600,000

80

assigned sequences in bp

% macro-precision and macro-recall

90

unassigned superkingdom

rank

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 4.4 4.1

53833 71113 85694 111601 113724 140308 268770 0 791210 845043

0 0 0 14697 10562 25633 68288 121285 240465 240465

0 0 0 0 0 0 0 0 0 0

(c) new species scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.1 0.5 4.6 29.7 27.0 44.8 0.0 15.2 13.3

1 2 6 9 18 20 17 8 11.4 10.1

100.0 86.2 65.7 57.5 33.0 24.5 11.9 0.0 39.8 47.3

0.0 8.4 38.2 34.8 38.4 35.6 29.1 0.0 26.4 23.1

1 2 8 12 36 52 72 96 39.7 34.9

sum true (bp)

sum false (bp)

overall prec

description

196059

0

100.0

root+superkingdom

311019

25259

92.5

phylum+class+order

409078

215206

65.5

family+genus+species

76.7 77.8

all but unassigned all with unassigned

100.0 100.0 74.9 74.0 55.0 54.8 46.7 32.8 62.6 67.3

0.0 0.0 43.2 42.8 49.3 49.1 49.5 45.8 39.9 35.0

1 2 8 12 36 52 72 96 39.7 34.9

overall prec

description

29660

0

100.0

root+superkingdom

123849

0

100.0

phylum+class+order

942014

0

100.0

family+genus+species

100.0 100.0

all but unassigned all with unassigned

500,000

70 60

400,000

50

300,000

40 30

200,000

20

100,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

0

Supplementary Figure S19 - MEGAN binning for FAMeS SimHC rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

400,000

50

300,000

40 30

200,000

20

100,000

10 family

genus

species

0

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

60

assigned sequences in bp

% macro-precision and macro-recall

500,000

70

order

1 2 6 9 20 29 34 33 19.0 16.8

sum false (bp)

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 3.6 3.4

60008 103063 142075 171151 168553 135983 0 0 720825 780833

0 0 0 27602 29605 53126 123176 71166 304675 304675

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 99.9 98.3 92.0 73.6 55.4 0.0 0.0 59.9 64.9

(d) new genus scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.1 2.3 6.7 34.2 39.5 0.0 0.0 11.8 10.4

1 2 6 9 18 16 12 5 9.7 8.6

100.0 85.7 62.6 49.7 25.2 12.0 0.0 0.0 33.6 41.9

0.0 8.0 36.7 31.2 32.4 25.2 0.0 0.0 19.1 16.7

1 2 8 12 36 52 72 96 39.7 34.9

sum true (bp)

sum false (bp)

overall prec

description

266134

0

100.0

root+superkingdom

481779

57207

89.4

phylum+class+order

135983

247468

35.5

family+genus+species

70.3 71.9

all but unassigned all with unassigned

MEGAN binning for FAMeS SimHC 600,000

80

class

0.0 0.0 1.0 0.8 0.7 3.1 2.8 2.9 1.6 1.4

sum true (bp)

taxonomic rank

90

phylum

real bins

600,000

MEGAN binning for FAMeS SimHC

unassigned superkingdom

stdev

80

0

100

0

macro recall

90

0

Supplementary Figure S19 - MEGAN binning for FAMeS SimHC

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

pred. bins

100

taxonomic rank

macro precision α=0.99 100.0 99.9 99.7 96.4 86.9 83.4 56.8 0.0 74.7 77.9

stdev

MEGAN binning for FAMeS SimHC

MEGAN binning for FAMeS SimHC 100

0

macro unknown precision (bp) α=0.99 0 100.0 0 100.0 0 99.6 0 99.6 0 99.8 0 99.4 2676 99.5 2139 99.5 4815 99.6 4815 99.7

(b) all reference scenario

assigned sequences in bp

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S19 - MEGAN binning for FAMeS SimHC

100 90

600,000

80

500,000

70 60

400,000

50

300,000

40 30

200,000

20

100,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 40 of 51

family

genus

species

0

assigned sequences in bp

rank

(a) summary scenario

false (bp) true (bp) macro precision α=0.99 macro recall

Supplementary Figure S19 - MEGAN binning for FAMeS SimHC true (bp)

false (bp)

0 1 2 3 4 5 6 7 2.9 2.4

180596 158948 202392 200853 137188 0 0 0 699381 879977

0 1776 3264 52078 48101 57702 31996 10614 205531 205531

macro unknown precision (bp) α=0.99 0 100.0 0 99.7 0 95.4 0 82.6 0 56.7 0 0.0 0 0.0 0 0.0 0 47.8 0 54.3

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.3 4.8 9.8 33.0 0.0 0.0 0.0 6.8 6.0

1 2 5 7 11 8 3 1 5.3 4.8

100.0 69.3 35.9 30.3 8.4 0.0 0.0 0.0 20.5 30.5

0.0 13.7 33.9 27.9 17.2 0.0 0.0 0.0 13.2 11.6

1 2 8 12 36 52 72 96 39.7 34.9

sum true (bp)

sum false (bp)

overall prec

description

498492

1776

99.6

root+superkingdom

540433

103443

83.9

phylum+class+order

0

100312

0.0

family+genus+species

77.3 81.1

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (bp)

false (bp)

0 1 2 3 4 5 6 7 2.5 2.3

115421 221464 250454 197513 0 0 0 0 669431 784852

0 1776 8635 58615 102179 40254 35656 52151 299266 299266

600,000

80

500,000

70 60

400,000

50

300,000

40 30

200,000

20

100,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

90

0

rank

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 2.2 1.6

271642 343666 175618 0 0 0 0

0 1776 16711 146037 72649 34445 22964

0 0 0 0 0 0 0

519284 790926

294582 294582

0 0

25.5 36.1

(g) new class scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 38.2 0.0 0.0 0.0 0.0

1 1 3 7 9 5 2

6.4 5.5

4.5 4.0

100.0 37.7 10.0 0.0 0.0 0.0 0.0 0.0 6.8 18.5

0.0 37.7 18.1 0.0 0.0 0.0 0.0 0.0 8.0 7.0

1 2 8 12 36 52 72 96 39.7 34.9

sum true (bp)

sum false (bp)

overall prec

description

958974

1776

99.8

root+superkingdom

175618

235397

42.7

phylum+class+order

0

57409

0.0

family+genus+species

63.8 72.9

all but unassigned all with unassigned

100.0 44.3 29.2 16.7 0.0 0.0 0.0 0.0 12.9 23.8

0.0 44.3 34.1 18.1 0.0 0.0 0.0 0.0 13.8 12.1

1 2 8 12 36 52 72 96 39.7 34.9

overall prec

description

558349

1776

99.7

root+superkingdom

447967

169429

72.6

phylum+class+order

0

128061

0.0

family+genus+species

69.1 72.4

all but unassigned all with unassigned

500,000

70 60

400,000

50

300,000

40 30

200,000

20

100,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

0

Supplementary Figure S19 - MEGAN binning for FAMeS SimHC rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

400,000

50

300,000

40 30

200,000

20

100,000

10 family

genus

species

0

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

60

assigned sequences in bp

% macro-precision and macro-recall

500,000

70

order

1 1 4 7 12 6 5 2 5.3 4.8

sum false (bp)

depth

true (bp)

false (bp)

unknown (bp)

0 1 2 3 4 5 6 7 2.1 1.5

264184 381336 0 0 0 0 0 0 381336 645520

0 3355 97362 68849 114492 29700 12844 95842 422444 422444

0 0 0 17544 0 0 0 0 17544 17544

macro precision α=0.99 100.0 98.8 0.0 0.0 0.0 0.0 0.0 0.0 14.1 24.9

(h) new phylum scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 1 12 11 10 6 4 5 7.0 6.3

100.0 36.7 0.0 0.0 0.0 0.0 0.0 0.0 5.2 17.1

0.0 36.7 0.0 0.0 0.0 0.0 0.0 0.0 5.2 4.6

1 2 8 12 36 52 72 96 39.7 34.9

sum true (bp)

sum false (bp)

overall prec

description

1026856

3355

99.7

root+superkingdom

0

280703

0.0

phylum+class+order

0

138386

0.0

family+genus+species

47.4 60.4

all but unassigned all with unassigned

MEGAN binning for FAMeS SimHC 600,000

80

class

0.0 0.0 18.2 16.6 0.0 0.0 0.0 0.0 5.0 4.3

sum true (bp)

taxonomic rank

90

phylum

real bins

600,000

MEGAN binning for FAMeS SimHC

unassigned superkingdom

stdev

80

0

100

0

macro recall

90

0

Supplementary Figure S19 - MEGAN binning for FAMeS SimHC

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

pred. bins

100

taxonomic rank

macro precision α=0.99 100.0 99.3 53.5 0.0 0.0 0.0 0.0

stdev

MEGAN binning for FAMeS SimHC

100

assigned sequences in bp

% macro-precision and macro-recall

MEGAN binning for FAMeS SimHC

macro unknown precision (bp) α=0.99 0 100.0 0 99.4 0 84.2 1390 62.1 0 0.0 0 0.0 0 0.0 0 0.0 1390 35.1 1390 43.2

(f) new order scenario

assigned sequences in bp

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S19 - MEGAN binning for FAMeS SimHC

100 90

600,000

80

500,000

70 60

400,000

50

300,000

40 30

200,000

20

100,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 41 of 51

family

genus

species

0

assigned sequences in bp

rank

(e) new family scenario

false (bp) true (bp) macro precision α=0.99 macro recall

Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC true (bp)

false (bp)

0 1 2 3 4 5 6 7 2.3 1.8

213863.4 389604.7 144608.1 74893.3 39907.3 31822.9 59831.9 61405.0 802073.1 1015936.6

0.0 1240.4 5934.9 12043.4 14116.0 11110.7 13687.3 9917.1 68049.9 68049.9

macro unknown precision (bp) α=0.99 0.0 100.0 0.0 99.9 0.0 96.5 757.0 92.7 0.0 65.6 0.0 68.9 382.3 75.1 382.3 76.6 1521.6 82.2 1521.6 84.4

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.1 5.1 8.4 44.8 43.0 40.3 41.5 26.2 22.9

1 2 7 11 47 58 68 66 37.0 32.5

100.0 64.1 34.9 24.4 15.3 11.7 8.2 4.1 23.2 32.8

0.0 14.9 7.9 10.3 10.0 9.7 7.5 4.8 9.3 8.1

1 2 8 12 36 52 72 96 39.7 34.9

sum true (bp) 993072.9

sum false (bp) 1240.4

overall prec. 99.9

description root+superkingdom

259408.7

32094.3

89.0

phylum+class+order

153059.7

34715.1

81.5

family+genus+species

92.2 93.7

all but unassigned all with unassigned

rank

depth 0 1 2 3 4 5 6 7 3.6 3.3

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

true (bp) 47885 163775 (bp) 70870 65908 48339 72448 179113 429835 1030288 1078173

500,000

70 60

400,000

50

300,000

40 30

200,000

20

100,000

10 phylum

class

order

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

600,000

80

assigned sequences in bp

% macro-precision and macro-recall

90

unassigned superkingdom

0 0 0 0 0 2520 2520 2520

rank

depth 0 1 2 3 4 5 6 7 3.5 3.2

true (bp) 92565 257961 (bp) 152494 94075 96579 80651 239710 0 921470 1014035

false (bp)

unknown (bp)

0 0 6930 10331 34526 19686 71473 71473

0 0 0 0 0 0 0 0 0 0

(c) new species scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 0.0 28.8 30.4 46.1 0.0 15.0 13.2

1 2 7 10 26 30 31 9 16.4 14.5

100.0 84.1 62.6 48.6 30.0 23.7 13.9 0.0 37.6 45.4

0.0 6.4 26.6 28.5 30.1 31.8 26.1 0.0 21.3 18.7

1 2 8 12 36 52 72 96 39.7 34.9

sum true (bp)

sum false (bp)

608487

0

overall prec. 100.0

description root+superkingdom

343148

6930

98.0

phylum+class+order

320361

64543

83.2

family+genus+species

92.8 93.4

all but unassigned all with unassigned

1 2 6 9 32 43 55 52 28.4 25.0

100.0 97.6 70.3 59.3 55.5 49.9 43.1 28.5 57.8 63.0

0.0 2.4 28.5 28.9 30.6 34.0 35.7 33.5 27.7 24.2

1 2 8 12 36 52 72 96 39.7 34.9

sum false (bp)

overall prec.

description

375435

0

100.0

root+superkingdom

185117

0

100.0

phylum+class+order

681396

2520

99.6

family+genus+species

99.8 99.8

all but unassigned all with unassigned

500,000

70 60

400,000

50

300,000

40 30

200,000

20

100,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

0

Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC rank

depth 0 1 2 3 4 5 6 7 2.4 2.2

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

400,000

50

300,000

40 30

200,000

20

100,000

10 family

genus

species

0

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

60

assigned sequences in bp

% macro-precision and macro-recall

500,000

70

order

0.0 0.0 1.2 1.1 0.4 0.4 0.3 0.3 0.5 0.5

sum true (bp)

true (bp) 121975 341016 (bp) 232318 148392 88010 69661 0 0 879397 1001372

false (bp)

unknown (bp)

0 5857 11695 14936 39229 12419 84136 84136

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 100.0 100.0 98.6 77.2 58.7 0.0 0.0 62.1 66.8

(d) new genus scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 2.6 38.5 46.8 0.0 0.0 12.6 11.0

1 2 7 11 23 20 14 7 12.0 10.6

100.0 82.7 57.5 37.0 17.8 8.1 0.0 0.0 29.0 37.9

0.0 4.9 14.0 20.9 24.3 19.0 0.0 0.0 11.9 10.4

1 2 8 12 36 52 72 96 39.7 34.9

sum true (bp)

sum false (bp)

overall prec.

description

804007

0

100.0

root+superkingdom

468720

17552

96.4

phylum+class+order

69661

66584

51.1

family+genus+species

91.3 92.2

all but unassigned all with unassigned

taxator-tk binning for FAMeS SimHC 600,000

80

class

real bins

taxonomic rank

90

phylum

stdev

600,000

taxator-tk binning for FAMeS SimHC

unassigned superkingdom

macro recall

80

0

100

0

pred. bins

90

0

Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

stdev

100

taxonomic rank

macro precision α=0.99 100.0 100.0 100.0 100.0 88.6 86.9 65.5 0.0 77.3 80.1

macro unknown precision (bp) α=0.99 0 100.0 0 100.0 0 99.4 2139 99.5 0 99.9 0 99.9 2676 100.0 0 100.0 4815 99.8 4815 99.8

taxator-tk binning for FAMeS SimHC

taxator-tk binning for FAMeS SimHC 100

0

false (bp)

(b) all reference scenario

assigned sequences in bp

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC

100 90

600,000

80

500,000

70 60

400,000

50

300,000

40 30

200,000

20

100,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 42 of 51

family

genus

species

0

assigned sequences in bp

rank

(a) summary scenario

false (bp) true (bp) macro precision α=0.99 macro recall

Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC

0 1 2 3 4 5 6 7 2.0 1.5

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

true (bp) 243406 383364 (bp) 215052 139146 46423 0 0 0 783985 1027391

false (bp)

1898 11207 8393 19955 7389 7499 56341 56341

macro unknown precision (bp) α=0.99 0 100.0 0 99.6 0 98.5 0 92.4 0 57.8 0 0.0 0 0.0 0 0.0 0 49.8 0 56.0

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 1.9 11.7 44.5 0.0 0.0 0.0 8.3 7.3

1 1 7 10 12 11 6 3 7.1 6.4

100.0 55.2 28.4 20.6 3.6 0.0 0.0 0.0 15.4 26.0

0.0 21.8 17.2 14.9 9.5 0.0 0.0 0.0 9.1 7.9

1 2 8 12 36 52 72 96 39.7 34.9

sum true (bp)

sum false (bp)

1010134

0

400621

21498

0

34843

overall prec. 100.0

description root+superkingdom

94.9

phylum+class+order

0.0

family+genus+species

93.3 94.8

all but unassigned all with unassigned

rank

depth 0 1 2 3 4 5 6 7 1.7 1.4

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

true (bp) 197736 480003 (bp) 238652 76732 0 0 0 0 795387 993123

600,000

80

500,000

70 60

400,000

50

300,000

40 30

200,000

20

100,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

90

0

3399 11575 38737 16635 11103 9160 90609 90609

rank

depth 0 1 2 3 4 5 6 7 1.4 0.9

true (bp) 366106 538424 (bp) 102871 0 0 0 0 0 641295 1007401

false (bp)

unknown (bp)

9200 39497 11532 7485 3564 5053 76331 76331

0 0 0 0 0 0 0 0 0 0

(g) new class scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 42.8 0.0 0.0 0.0 0.0 0.0 6.1 5.3

1 1 4 9 8 6 2 2 4.6 4.1

100.0 38.5 3.8 0.0 0.0 0.0 0.0 0.0 6.0 17.8

0.0 27.4 7.3 0.0 0.0 0.0 0.0 0.0 5.0 4.3

1 2 8 12 36 52 72 96 39.7 34.9

sum true (bp)

sum false (bp)

1442954

0

102871

60229

0

16102

overall prec. 100.0

description root+superkingdom

63.1

phylum+class+order

0.0

family+genus+species

89.4 93.0

all but unassigned all with unassigned

1 1 5 7 13 9 6 3 6.3 5.6

100.0 56.7 21.9 5.3 0.0 0.0 0.0 0.0 12.0 23.0

0.0 23.4 17.7 5.9 0.0 0.0 0.0 0.0 6.7 5.9

1 2 8 12 36 52 72 96 39.7 34.9

sum false (bp)

overall prec.

description

1157742

0

100.0

root+superkingdom

315384

53711

85.4

phylum+class+order

0

36898

0.0

family+genus+species

89.8 91.6

all but unassigned all with unassigned

500,000

70 60

400,000

50

300,000

40 30

200,000

20

100,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

false (bp) true (bp) macro precision α=0.99 macro recall

0

Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC rank

depth 0 1 2 3 4 5 6 7 1.2 0.7

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

400,000

50

300,000

40 30

200,000

20

100,000

10 family

genus

species

0

false (bp) true (bp) macro precision α=0.99 macro recall

% macro-precision and macro-recall

60

assigned sequences in bp

% macro-precision and macro-recall

500,000

70

order

0.0 0.0 12.7 25.1 0.0 0.0 0.0 0.0 5.4 4.7

sum true (bp)

true (bp) 427371 562690 (bp) 0 0 0 0 0 0 562690 990061

false (bp)

unknown (bp)

27047 16168 21525 8433 0 13083 86256 86256

0 0 0 3160 0 0 0 2676 5836 5836

macro precision α=0.99 100.0 99.1 0.0 0.0 0.0 0.0 0.0 0.0 14.2 24.9

(h )new phylum scenario stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 1 11 11 10 7 3 5 6.9 6.1

100.0 34.1 0.0 0.0 0.0 0.0 0.0 0.0 4.9 16.8

0.0 23.0 0.0 0.0 0.0 0.0 0.0 0.0 3.3 2.9

1 2 8 12 36 52 72 96 39.7 34.9

sum true (bp)

sum false (bp)

overall prec.

description

1552751

0

100.0

root+superkingdom

0

64740

0.0

phylum+class+order

0

21516

0.0

family+genus+species

86.7 92.0

all but unassigned all with unassigned

taxator-tk binning for FAMeS SimHC 600,000

80

class

real bins

taxonomic rank

90

phylum

stdev

600,000

taxator-tk binning for FAMeS SimHC

unassigned superkingdom

macro recall

80

0

100

0

pred. bins

90

0

Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

stdev

100

taxonomic rank

macro precision α=0.99 100.0 99.5 42.6 0.0 0.0 0.0 0.0 0.0 20.3 30.3

macro unknown precision (bp) α=0.99 0 100.0 0 99.6 0 91.7 0 70.7 0 0.0 0 0.0 0 0.0 0 0.0 0 37.4 0 45.2

taxator-tk binning for FAMeS SimHC

100

assigned sequences in bp

% macro-precision and macro-recall

taxator-tk binning for FAMeS SimHC

false (bp)

(f) new order scenario

assigned sequences in bp

depth

Supplementary Figure S20 - Taxator-tk binning for FAMeS SimHC

100 90

600,000

80

500,000

70 60

400,000

50

300,000

40 30

200,000

20

100,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 43 of 51

family

genus

species

0

assigned sequences in bp

rank

(e) new family scenario

false (bp) true (bp) macro precision α=0.99 macro recall

Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011) true (kb)

false (kb)

0 1 2 3 4 5 6 7 2.1 2.1

199.2 4663.42 7936.17 2215.89 1729.56 191.38 19 1.73 16757.15 16956.35

0 2.03 2.11 25.89 28.97 13.42 11.53 0 83.95 83.95

macro unknown (kb) precision α=0.95 0 100.0 0 100.0 0 100.0 0 69.5 0 68.1 0 60.1 0 50.0 0 100.0 0 78.2 0 81.0

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 42.7 45.1 47.2 50.0 0.0 26.4 23.1

1 1 1 3 6 14 10 1 5.1 4.6

100.0 49.4 29.9 18.9 20.1 17.1 9.1 2.1 20.9 30.8

0.0 49.4 36.1 28.7 32.0 33.0 25.3 14.4 31.3 27.4

1 2 8 12 22 29 37 47 22.4 19.8

sum true (kb)

sum false (kb)

rank

overall prec.

description

9526.04

2.03

100.0

root+superkingdom

11881.62

56.97

99.5

phylum+class+order

212.11

24.95

89.5

family+genus+species

99.5 99.5

all but unassigned all with unassigned

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

false (kb)

0 1 2 3 4 5 6 7 2.5 2.2

1461.8 3788.36 4989.57 3035.46 3150.39 347.7 17.35 12.23 15341.06 16802.86

0 0 0 58.8 34.78 56.72 31.58 55.55 237.43 237.43

9,000

90

50

5,000

40

4,000

30

3,000

20

2,000

10

1,000 unassigned superkingdom

phylum

class

order

family

genus

species

false (kb) true (kb) macro precision α=0.95 macro recall

precision and recall in %

6,000

60

stdev

real bins

100.0 95.7 59.9 27.7 22.5 19.2 10.9 6.4 34.6 42.8

0.0 4.3 33.3 30.1 32.4 35.2 28.9 24.4 26.9 23.6

1 2 8 12 22 29 37 47 22.4 19.8

sum true (kb)

sum false (kb)

overall prec.

description

9038.52

0

100.0

root+superkingdom

11175.42

93.58

99.2

phylum+class+order

377.28

143.85

72.4

family+genus+species

98.5 98.6

all but unassigned all with unassigned

8,000 7,000

70

6,000

60 50

5,000

40

4,000

30

3,000

20

2,000

10

1,000

0

0

unassigned superkingdom

phylum

depth

true (kb)

false (kb)

unknown (kb)

0 1 2 3 4 5 6 7 3.5 3.1

1014.03 1534.22 1896.2 1021.99 1725.65 935.47 18.97 457.46 7589.96 8603.99

0 0 1.83 108.21 118.82 266.12 1684.92 6256.4 8436.3 8436.3

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 100.0 100.0 65.6 29.6 13.6 9.2 6.1 46.3 53.0

class

order

family

genus

species

false (kb) true (kb) macro precision α=0.99 macro recall

0

taxonomic rank

Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011)

(c) MEGAN4 (nucleotide)

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 45.6 40.6 32.0 25.9 21.2 23.6 20.7

1 1 1 3 11 17 36 38 15.3 13.5

100.0 47.0 34.5 29.4 30.1 15.3 3.8 0.2 22.9 32.5

0.0 47.0 38.5 31.7 35.2 30.8 14.5 1.0 28.4 24.8

1 2 8 12 22 29 37 47 22.4 19.8

sum true (kb)

sum false (kb)

overall prec.

description

4082.47

0

100.0

root+superkingdom

4643.84

228.86

95.3

phylum+class+order

1411.9

8207.44

14.7

family+genus+species

47.4 50.5

all but unassigned all with unassigned

Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

MEGAN4 binning for FAMeS SimMC scenario (Nature Methods 2011) using nucleotide-level alignment

depth

true (kb)

false (kb)

unknown (kb)

0 1 2 3 4 5 6 7 3.7 3.3

1198.19 749.27 1920.91 1049.16 1771.8 961.81 2.33 457.46 6912.74 8110.93

0 0 19.56 158.44 166.53 367.01 1871.14 6346.69 8929.37 8929.37

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 100.0 50.6 49.0 23.1 13.5 7.5 7.4 35.9 43.9

(d) MEGAN5 (nucleotide)

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 49.3 47.9 38.4 32.0 22.8 23.2 30.5 26.7

1 1 2 4 10 17 32 31 13.9 12.3

100.0 46.5 35.1 27.6 25.3 13.2 2.0 0.2 21.4 31.2

0.0 46.5 39.8 33.1 34.2 29.5 10.2 1.0 27.7 24.3

1 2 8 12 22 29 37 47 22.4 19.8

sum true (kb)

sum false (kb)

overall prec.

description

2696.73

0

100.0

root+superkingdom

4741.87

344.53

93.2

phylum+class+order

1421.6

8584.84

14.2

family+genus+species

43.6 47.6

all but unassigned all with unassigned

MEGAN5 binning for FAMeS SimMC scenario (Nature Methods 2011) using nucleotide-level alignment 100

100 9,000

90

6,000

60 50

5,000

40

4,000

30

3,000

20

2,000

10

1,000 unassigned superkingdom

phylum

class

order

family

genus

species

0

8,000

80

false (kb) true (kb) macro precision α=0.99 macro recall

precision and recall in %

70

assigned sequences in kb

7,000

9,000

90

8,000

80

precision and recall in %

macro recall

1 1 1 3 7 18 18 13 8.7 7.8

9,000

taxonomic rank

0

pred. bins

0.0 0.0 0.0 46.5 44.7 49.0 45.7 44.0 32.8 28.7

80

assigned sequences in kb

7,000

70

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

stdev

90

8,000

80

rank

macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 100.0 0 65.8 0 59.1 0 54.6 0 32.1 0 29.1 0 63.0 0 67.6

100

100

precision and recall in %

true (kb)

taxtor-tk binning for FAMeS SimMC scenario (Nature Methods 2011) using protein-level alignment

taxtor-tk binning for FAMeS SimMC scenario (Nature Methods 2011) using nucleotide-level alignment

0

depth

(b) taxator-tk (amino acid)

assigned sequences in kb

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011)

7,000

70

6,000

60 50

5,000

40

4,000

30

3,000

20

2,000

10

1,000

0

unassigned superkingdom

phylum

class

order

taxonomic rank

taxonomic rank

taxator-tk Supplementary Material Page 44 of 51

family

genus

species

0

assigned sequences in kb

rank

(a) taxator-tk (nucleotide)

false (kb) true (kb) macro precision α=0.99 macro recall

Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011) true (kb)

false (kb)

0 1 2 3 4 5 6 7 3.3 3.2

232.06 1647.09 3471.29 2273.77 2950.28 1710.01 18.97 385.56 12456.97 12689.03

0 0 14 182.52 227.02 284.1 1024.71 2618.9 4351.25 4351.25

macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 51.0 0 49.1 0 31.1 0 15.6 0 12.3 0 11.1 0 38.6 0 46.3

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 49.0 47.6 37.3 32.3 29.1 30.2 32.2 28.2

1 1 2 4 11 18 29 25 12.9 11.4

100.0 49.3 38.7 32.8 25.4 7.6 3.8 0.2 22.6 32.2

0.0 49.3 43.2 35.3 35.4 20.1 14.5 0.8 28.4 24.8

1 2 8 12 22 29 37 47 22.4 19.8

sum true (kb)

sum false (kb)

overall prec.

description

3526.24

0

100.0

root+superkingdom

8695.34

423.54

95.4

phylum+class+order

2114.54

3927.71

35.0

family+genus+species

74.1 74.5

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (kb)

false (kb)

0 1 2 3 4 5 6 7 3.9 3.5

1788.99 168.48 1071.86 2853.36 5824.61 1266.67 364.14 89.11 11638.23 13427.22

0 0 49.59 446.35 769.52 1107.7 796.45 443.46 3613.07 3613.07

MEGAN5 binning for FAMeS SimMC scenario (Nature Methods 2011) using protein-level alignment

real bins

0.0 44.7 36.4 34.0 37.0 41.8 32.1 14.4 34.4 30.1

1 2 8 12 22 29 37 47 22.4 19.8

sum true (kb)

sum false (kb)

overall prec.

description

2125.95

0

100.0

root+superkingdom

9749.83

1265.46

88.5

phylum+class+order

1719.92

2347.61

42.3

family+genus+species

76.3 78.8

all but unassigned all with unassigned

60 50

5,000

40

4,000

30

3,000

20

2,000

10

1,000 unassigned superkingdom

phylum

class

order

family

genus

species

false (kb) true (kb) macro precision α=0.99 macro recall

precision and recall in %

6,000

9,000 8,000 7,000

70

6,000

60 50

5,000

40

4,000

30

3,000

20

2,000

10

1,000

0

0

unassigned superkingdom

phylum

taxonomic rank

depth

true (kb)

false (kb)

unknown (kb)

0 1 2 3 4 5 6 7 3.1 2.9

843.44 1483.22 3695.28 3671.01 5540.57 345.44 237.82 170.45 15143.79 15987.23

0 0 28.78 275.38 234.56 315.09 174.47 24.78 1053.06 1053.06

0 0 0 0 0 0 0 0 0 0

macro precision α=0.99 100.0 100.0 100.0 65.5 48.0 28.6 21.7 19.0 54.7 60.3

class

order

family

genus

species

false (kb) true (kb) macro precision α=0.99 macro recall

0

taxonomic rank

Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011)

(g) CARMA (amino acid)

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 44.9 43.3 41.3 39.9 38.2 29.7 26.0

1 1 1 3 10 32 36 25 15.4 13.6

100.0 47.5 46.0 35.4 32.1 26.3 12.4 4.3 29.2 38.0

0.0 47.5 38.1 32.6 33.8 39.1 29.7 20.2 34.4 30.1

1 2 8 12 22 29 37 47 22.4 19.8

sum true (kb)

sum false (kb)

overall prec.

description

3809.88

0

100.0

root+superkingdom

12906.86

538.72

96.0

phylum+class+order

753.71

514.34

59.4

family+genus+species

93.5 93.8

all but unassigned all with unassigned

Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011) rank

depth 0 1 2 3 4 5 6 7 5.0 5.0

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

CARMA binning for FAMeS SimMC scenario (Nature Methods 2011) using protein-level alignment

true (kb)

false (kb)

unknown (kb)

0 832.36 517.19 1297.42 659.69 5715.02 7116.9

0 0 25.6 52.21 76.38 272.58 474.94

0 0 0 0 0 0 0

macro precision α=0.99 100.0 100.0 100.0 67.6 49.6 49.3 49.1

16138.58 16138.58

901.71 901.71

0 0

69.2 73.6

(h) PhyloPythiaS

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 0.0 45.5 44.9 48.2 46.0

1 1 1 3 6 6 6

100.0 49.9 50.8 54.5 37.4 33.8 23.3

0.0 49.9 41.1 40.0 34.2 40.7 37.4

1 2 8 12 22 29 37

30.8 26.4

3.8 3.4

41.6 49.9

40.6 34.8

18.3 15.9

sum true (kb)

sum false (kb)

overall prec.

description

1664.72

0

100.0

root+superkingdom

2474.3

154.19

94.1

phylum+class+order

12831.92

747.52

94.5

family+genus+species

94.7 94.7

all but unassigned all with unassigned

PhyloPythiaS binning for FAMeS SimMC scenario (Nature Methods 2011)

100

100 9,000

90

6,000

60 50

5,000

40

4,000

30

3,000

20

2,000

10

1,000 unassigned superkingdom

phylum

class

order

family

genus

species

0

8,000

80

false (kb) true (kb) macro precision α=0.99 macro recall

precision and recall in %

70

assigned sequences in kb

7,000

9,000

90

8,000

80

precision and recall in %

stdev

100.0 44.7 45.9 37.3 39.7 30.3 13.5 2.2 30.5 39.2

80

assigned sequences in kb

precision and recall in %

7,000

70

0

macro recall

1 1 3 5 19 50 93 135 43.7 38.4

90

8,000

80

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

pred. bins

0.0 0.0 40.4 46.0 35.1 28.0 23.0 15.1 26.8 23.4

100 9,000

90

rank

stdev

CARMA binning for FAMeS SimMC scenario (Nature Methods 2011) using nucleotide-level alignment

100

0

macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 48.1 0 38.2 0 23.6 0 13.7 0 6.8 0 2.5 0 33.3 0 41.6

(f) CARMA (nucleotide)

assigned sequences in kb

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011)

7,000

70

6,000

60 50

5,000

40

4,000

30

3,000

20

2,000

10

1,000

0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 45 of 51

family

genus

species

0

assigned sequences in kb

rank

(e) MEGAN5 (amino acid)

false (kb) true (kb) macro precision α=0.99 macro recall

Supplementary Figure S21 - Binning for FAMeS SimMC scenario (Nature Methods 2011) rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

true (kb)

false (kb)

0 1 2 3 4 5 6 7 4.4 0.9

5974.51 78.01 208.37 144 239.27 78.26 1.01 527.7 1276.62 7251.13

0 0 0 17.5 61.94 105.58 578.56 9025.58 9789.16 9789.16

macro unknown (kb) precision α=0.99 0 100.0 0 100.0 0 45.2 0 36.6 0 13.2 0 8.2 0 4.3 0 2.1 0 29.9 0 38.7

(i) Kraken

stdev

pred. bins

macro recall

stdev

real bins

0.0 0.0 41.4 44.9 28.5 25.2 17.1 13.1 24.3 21.3

1 1 3 5 21 43 83 123 39.9 35.0

100.0 32.5 32.6 28.2 36.7 32.6 20.6 11.5 27.8 36.8

0.0 32.5 36.0 33.5 38.6 42.5 37.9 30.9 36.0 31.5

1 2 8 12 22 29 37 47 22.4 19.8

sum true (kb)

sum false (kb)

overall prec.

description

6130.53

0

100.0

root+superkingdom

591.64

79.44

88.2

phylum+class+order

606.97

9709.72

5.9

family+genus+species

11.5 42.6

all but unassigned all with unassigned

Kraken binning for FAMeS SimMC scenario (Nature Methods 2011) 100 9,000

90

precision and recall in %

7,000

70

6,000

60 50

5,000

40

4,000

30

3,000

20

2,000

10

1,000

0

unassigned superkingdom

phylum

class

order

family

genus

species

assigned sequences in kb

8,000

80

false (kb) true (kb) macro precision α=0.99 macro recall

0

taxonomic rank

taxator-tk Supplementary Material Page 46 of 51

Supplementary Figure S22 - Binning for partitioned cow rumen sample consistent (kb)

inconsistent (kb)

unknown (kb)

0 1 2 3 4 5 6 7 1.8 1.0

144478 102968 25730 2256 13988 2400 5552 0 152894 297372

0 154 3872 1350 1810 964 1090 10670 19910 19910

0 0 22 54 42 104 132 890 1244 1244

macro consistency α=0.99 100.0 99.9 66.6 61.1 55.4 52.1 52.6 0.0 55.4 61.0

pred. bins

macro recall

stdev

cons. bins

0.0 1 0.0 1 20.9 13 18.7 28 17.1 62 23.2 167 36.6 572 0.0 1254 16.6 299.6 14.5 262.3

100.0 42.6 13.0 11.1 9.7 9.0 9.2 0.0 13.5 24.3

0.0 13.9 5.8 4.7 4.5 4.8 5.1 0.0 5.5 4.8

1 2 30 52 99 198 446 926 250.4 219.3

stdev

sum true (kb)

sum false overall (kb) consist.

description

350414

154

100.0

root+superkingdom

41974

7032

85.7

phylum+class+order

7952

12724

38.5

family+genus+species

88.5 93.7

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

80

140,000

80

70

120,000

60

100,000

% macro-consistency

50

80,000

40

60,000

30

40,000

20

20,000

10 class

order

family

genus

species

unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99

% macro-consistency

90

90

assigned sequences in kb

160,000

phylum

inconsistent (kb)

unknown (kb)

0 1 2 3 4 5 6 7 3.0 0.9

192034 23020 17210 2620 19082 2190 7412 16336 87870 279904

0 66 2812 1202 3646 2166 4932 21246 36070 36070

0 0 12 6 14 82 216 2222 2552 2552

rank

depth

consistent (kb)

inconsistent (kb)

unknown (kb)

0 1 2 3 4 5 6 7 2.7 1.8

87760 65560 35352 2242 42082 2802 12764 31322 192124 279884

0 116 2802 1090 3308 3220 6726 18358 35620 35620

0 0 34 42 66 178 436 2266 3022 3022

1 2 30 52 104 221 578 1330 331.0 289.8

sum true (kb)

sum false overall (kb) consist.

description

238074

66

100.0

root+superkingdom

38912

7660

83.6

phylum+class+order

25938

28344

47.8

family+genus+species

70.9 88.6

all but unassigned all with unassigned

140,000

60

100,000

50

80,000

40

60,000

30

40,000

20

20,000

10

0

(c) MEGAN4 (nucleotide)

unassigned superkingdom

phylum

class

order

family

genus

species

unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99

0

pred. bins

macro recall

stdev

cons. bins

100.0 43.8 24.7 19.7 15.5 13.6 12.6 11.7 20.2 30.2

0.0 26.2 16.9 13.9 11.9 9.2 8.0 7.6 13.4 11.7

1 3 27 48 88 168 295 535 166.3 145.6

sum true (kb)

sum false overall (kb) consist.

description

218880

116

99.9

root+superkingdom

79676

7200

91.7

phylum+class+order

46888

28304

62.4

family+genus+species

84.4 88.7

all but unassigned all with unassigned

Supplementary Figure S22 - Binning for partitioned cow rumen sample rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

consistent (kb)

inconsistent (kb)

unknown (kb)

0 1 2 3 4 5 6 7 2.5 2.2

36062 90540 59686 6054 46384 4208 13258 24518 244648 280710

0 220 4074 1614 4808 4130 6756 12322 33924 33924

0 0 26 24 118 358 778 2588 3892 3892

macro consistency α=0.99 100.0 99.9 67.2 51.0 36.3 35.1 36.7 46.2 53.2 59.1

(d) MEGAN5 (amino acid)

pred. bins

macro recall

stdev

cons. bins

0.0 1 0.0 1 25.6 12 27.9 25 27.1 52 21.7 119 20.3 203 20.5 347 20.4 108.4 17.9 95.0

100.0 72.9 25.3 20.1 15.9 13.1 12.3 11.5 24.4 33.9

0.0 15.9 17.8 13.4 11.4 8.7 7.9 8.6 11.9 10.5

1 2 26 44 79 140 218 343 121.7 106.6

stdev

sum true (kb)

sum false overall (kb) consist.

description

217142

220

99.9

root+superkingdom

112124

10496

91.4

phylum+class+order

41984

23208

64.4

family+genus+species

87.8 89.2

all but unassigned all with unassigned

MEGAN binning for partitioned cow rumen sample using protein-level alignment

100

100 160,000

90

60

100,000

50

80,000

40

60,000

30

40,000

20

20,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

0

unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99

% macro-consistency

120,000

assigned sequences in kb

70

160,000

90

140,000

80

% macro-consistency

cons. bins

0.0 7.1 6.0 4.5 4.3 4.1 4.7 5.0 5.1 4.5

120,000

MEGAN binning for partitioned cow rumen sample using nucleotide-level alignment

0

stdev

taxonomic rank

0.0 1 0.0 1 24.6 12 27.7 25 26.3 51 21.1 132 20.4 264 21.9 564 20.3 149.9 17.8 131.3

stdev

macro recall 100.0 33.2 15.4 13.9 12.3 11.0 11.0 10.4 15.3 25.9

70

0

Supplementary Figure S22 - Binning for partitioned cow rumen sample

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

pred. bins

0.0 1 0.0 1 27.8 15 21.7 31 17.3 71 14.8 188 19.3 611 29.0 1956 18.6 410.4 16.2 359.3

stdev

160,000

taxonomic rank

macro consistency α=0.99 100.0 99.9 67.4 51.9 39.6 34.9 33.6 38.9 52.3 58.3

macro consistency α=0.99 100.0 99.9 47.1 39.6 31.9 31.6 31.1 31.0 44.6 51.5

100

100

unassigned superkingdom

consistent (kb)

CARMA binning for partitioned cow rumen sample using protein-level alignment

CARMA binning for partitioned cow rumen sample using nucleotide-level alignment

0

depth

(b) CARMA (amino acid)

assigned sequences in kb

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S22 - Binning for partitioned cow rumen sample

80

140,000

70

120,000

60

100,000

50

80,000

40

60,000

30

40,000

20

20,000

10 0

unassigned superkingdom

taxonomic rank

phylum

class

order

taxonomic rank

taxator-tk Supplementary Material Page 47 of 51

family

genus

species

0

assigned sequences in kb

rank

(a) CARMA (nucleotide)

unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99

Supplementary Figure S22 - Binning for partitioned cow rumen sample consistent (kb)

inconsistent (kb)

unknown (kb)

0 1 2 3 4 5 6 7 1.9 1.2

122146 115428 39828 1676 23582 2524 8938 1348 193324 315470

0 62 1152 334 726 198 198 44 2714 2714

0 0 4 28 28 100 94 88 342 342

macro consistency α=0.99 100.0 100.0 87.7 80.2 78.3 79.3 76.2 78.0 82.8 85.0

stdev

pred. bins

macro recall

stdev

cons. bins

0.0 0.0 16.9 17.0 20.2 19.8 35.9 37.4 21.0 18.4

1 1 7 14 16 50 110 123 45.9 40.3

100.0 56.8 16.4 13.7 11.7 10.3 9.8 8.6 18.2 28.4

0.0 7.3 12.3 11.3 10.8 8.8 7.8 6.7 9.3 8.1

1 2 22 34 56 84 94 103 56.4 49.5

sum true (kb)

sum false overall (kb) consist.

description

353002

62

100.0

root+superkingdom

65086

2212

96.7

phylum+class+order

12810

440

96.7

family+genus+species

98.6 99.1

all but unassigned all with unassigned

rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

consistent (kb)

inconsistent (kb)

unknown (kb)

0 1 2 3 4 5 6 7 1.8 1.6

36398 162352 63942 5920 34474 2586 8344 1136 278754 315152

0 144 1604 410 600 92 228 8 3086 3086

0 0 2 24 30 72 82 78 288 288

taxator-tk binning for partitioned cow rumen sample using nucleotide-level alignment

120,000

60

100,000

50

80,000

40

60,000

30

40,000

20

20,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99

% macro-consistency

140,000

70

cons. bins

0.0 14.5 15.7 13.1 12.4 9.4 8.1 7.8 11.6 10.1

1 2 17 26 37 51 55 49 33.9 29.8

sum true (kb)

sum false overall (kb) consist.

description

361102

144

100.0

root+superkingdom

104336

2614

97.6

phylum+class+order

12066

328

97.4

family+genus+species

98.9 99.0

all but unassigned all with unassigned

80

140,000

70

120,000

60

100,000

50

80,000

40

60,000

30

40,000

20

20,000

10

0

0

unassigned superkingdom

phylum

class

depth

consistent (kb)

inconsistent (kb)

unknown (kb)

0 1 2 3 4 5 6 7 2.0 2.0

0 148138 65136 9468 31734 3990 6144 3708 268318 268318

0 2810 24220 9930 6490 1438 1078 538 46504 46504

0 0 4 568 828 708 1072 524 3704 3704

macro consistency α=0.99 100.0 100.0 66.6 55.3 56.2 58.7 62.8 64.7 66.3 70.6

order

family

genus

species

unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99

0

taxonomic rank

Supplementary Figure S22 - Binning for partitioned cow rumen sample

(g) PhyloPythiaS

stdev

pred. bins

macro recall

stdev

cons. bins

0.0 0.0 15.1 22.0 21.0 18.9 18.5 29.0 17.8 15.6

1 1 4 10 19 30 33 64 23.0 20.3

100.0 81.6 31.0 21.0 12.8 10.4 9.2 8.0 24.9 34.3

0.0 17.5 14.7 10.7 4.9 4.1 4.2 4.6 8.7 7.6

1 2 7 13 25 39 45 67 28.3 24.9

sum true (kb)

sum false overall (kb) consist.

description

296276

2810

99.1

root+superkingdom

106338

40640

72.3

phylum+class+order

13842

3054

81.9

family+genus+species

85.2 85.2

all but unassigned all with unassigned

Supplementary Figure S22 - Binning for partitioned cow rumen sample rank unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

consistent (kb)

inconsistent (kb)

unknown (kb)

0 1 2 3 4 5 6 7 3.9 0.2

265616 2262 728 190 1116 254 1378 17642 23570 289186

0 4 504 434 690 684 2736 23798 28850 28850

0 0 0 2 8 8 34 438 490 490

macro consistency α=0.99 100.0 99.9 42.6 41.3 34.5 33.5 32.4 33.3 45.4 52.2

(h) Kraken

pred. bins

macro recall

stdev

cons. bins

0.0 1 0.0 1 25.6 18 22.9 33 18.4 77 15.3 195 19.5 661 28.2 1953 18.6 419.7 16.2 367.4

100.0 21.0 10.8 10.5 9.9 9.3 9.7 9.9 11.6 22.6

0.0 1.2 5.8 5.3 5.0 4.3 4.6 5.2 4.5 3.9

1 2 30 52 110 233 640 1461 361.1 316.1

stdev

sum true (kb)

sum false overall (kb) consist.

270140

4

100.0

root+superkingdom

2034

1628

55.5

phylum+class+order

19274

27218

41.5

family+genus+species

45.0 90.9

all but unassigned all with unassigned

description

Kraken binning for partitioned cow rumen sample

PhyloPythiaS binning for partitioned cow rumen sample 100

100 160,000

90

60

100,000

50

80,000

40

60,000

30

40,000

20

20,000

10 unassigned superkingdom

phylum

class

order

family

genus

species

0

unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99

% macro-consistency

120,000

assigned sequences in kb

70

160,000

90

140,000

80

% macro-consistency

stdev

100.0 74.1 20.9 16.3 14.0 11.8 10.5 9.8 22.5 32.2

160,000

taxonomic rank

0

macro recall

1 1 5 10 9 32 27 59 20.4 18.0

90

assigned sequences in kb

% macro-consistency

80

rank

pred. bins

0.0 0.0 6.2 14.1 15.4 22.0 26.8 37.7 17.5 15.3

100 160,000

90

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

stdev

taxator-tk binning for partitioned cow rumen sample using protein-level alignment

100

0

macro consistency α=0.99 100.0 100.0 94.7 88.6 89.3 78.6 81.2 78.7 87.3 88.9

(f) taxator-tk (amino acid)

assigned sequences in kb

unassigned superkingdom phylum class order family genus species avg/sum avg/sum

depth

Supplementary Figure S22 - Binning for partitioned cow rumen sample

80

140,000

70

120,000

60

100,000

50

80,000

40

60,000

30

40,000

20

20,000

10 0

unassigned superkingdom

phylum

class

order

taxonomic rank

taxonomic rank

taxator-tk Supplementary Material Page 48 of 51

family

genus

species

0

assigned sequences in kb

rank

(e) taxator-tk (nucleotide)

unknown (kb) inconsistent (kb) consistent (kb) macro consistency α=0.99

Supplementary Figure S23: Parallel speedup of program taxator

taxator Assignment Speedup 14 12

speedup

10 8

speedup

6 4 2 0 1

2

5

10

15

20

number or CPU cores Execution time analysis with taxator for parallelized processing with multiple CPU cores. Taxonomic placement of sequence segments with taxator on input alignments for sequences of length 1000 bp (syn1000 data-set aligned against mRefSeq47 with LAST). The speedup was calculated using wall clock time for a parallelized run relative to serial execution with one CPU thread. With multiple threads, there is always one producer thread (consumer-producer model). Thus for more than two threads, multiple consumers work on the input data in parallel. An approximate linear scale-up was observed up to 15 threads and saturation effects appear when using 20 CPU cores on our system.

, with : serial execution time : execution time using p threads and CPU cores

taxator-tk Supplementary Material Page 49 of 51

Supplementary Figure S24: Effect of input sequence length and segmentation on taxator-tk processing time. (a)

taxator processing time time (min. wall clock)

with segmentation all reference new species new genus new family new order new class new phylum

180 150 120 90 60 30 0 100

500

1000

individual sequence length in bp (b)

taxator processing time time (min. wall clock)

without segmentation all reference new species new genus new family new order new class new phylum

180 150 120 90 60 30 0 100

500

1000

individual sequence length in bp We processed approximately the same number of sequences of length 100, 500 and 1000 bp with taxator-tk (syn100,syn500,syn1000), once with the segmentation procedure being enabled (a) and once with segmentation disabled (b). The run-time increases for both cases are approximately linear with the input length, where the slope depends on the completeness of the reference sequence data. With all reference data available, the run-time increases more than linear, as there is no segmentation of queries during computations. For all other cases, segmentation substantially decreases the execution time.

taxator-tk Supplementary Material Page 50 of 51

Supplementary Figure S25: Example GFF3 output of taxator ##gff-version 3 contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk contig_0 taxator-tk

Query Generator identifier

sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature sequence_feature

102 155 201 225 246 326 486 555 633 670 786 886 958

121 194 220 243 301 471 554 616 651 745 809 932 980

1 0.91 1 1 1 1 0.60 0.63 1 0.89 0.89 1 1

. . . . . . . . . . . . .

. . . . . . . . . . . . .

seqlen=1012;tax=1224:19;ival=0.5 seqlen=1012;tax=2:32-1;ival=0.8 seqlen=1012;tax=40324:20-2;ival=0 seqlen=1012;tax=316277:19-1;ival=1 seqlen=1012;tax=731:38-1224;ival=0.72 seqlen=1012;tax=338:87-1224;ival=0.98 seqlen=1012;tax=1224:59-2;ival=0.67 seqlen=1012;tax=32008:43-1224;ival=0.86 seqlen=1012;tax=876:19-1;ival=1 seqlen=1012;tax=31998:60-1;ival=0.89 seqlen=1012;tax=256618:23-2;ival=0.2 seqlen=1012;tax=644:33-2;ival=0.67 seqlen=1012;tax=347:22-1;ival=1

Type

Begin End Score Strand Phase Query length Taxonomic range and support Interpolation value

Query segment assignments calculated by the program taxator (version 1.1.1) are generated in standard GFF3 format. Each tab-separated field holds the information which is named in the bottom description. The score measures the assignment quality and is under ongoing improvement. Strand and phase contain a dot as placeholder as they are invalid GFF3 fields for this output. The last column holds data in a key-value scheme and includes the query sequence length, a taxonomic prediction range of the form low:support-high where low/specific (node X in Fig. 2a) and high/general (node R in Fig. 2a) are NCBI taxon IDs. The included interpolation value ranging from zero (low) and one (high) can be used to determine an approximate position in the given taxonomic range. As it might become necessary for post-processing applications such as whole sequence binning, more information can be added in the last column while preserving backward compatibility.

taxator-tk Supplementary Material Page 51 of 51