Supplementary Material Bioinformatics revised final

Metaxa2 Database Builder: Enabling taxonomic identification from metagenomic and metabarcoding data using any genetic marker Johan Bengtsson-Palme*, Rodney T. Richardson, Marco Meola, Christian Wurzbacher, Émilie D. Tremblay, Kaisa Thorell, Kärt Kanger, K. Martin Eriksson, Guillaume J. Bilodeau, Reed M. Johnson, Martin Hartmann, R. Henrik Nilsson * Address correspondence to [email protected]; +46 31 342 46 26; Department of Infectious Diseases, Institute of Biomedicine, The Sahlgrenska Academy, University of Gothenburg, Guldhedsgatan 10, SE-413 46, Gothenburg, Sweden; Center for Antibiotic Resistance research (CARe) at University of Gothenburg, Box 440, SE-40530, Gothenburg, Sweden; Wisconsin Institute for Discovery, University of Wisconsin-Madison, 330 N. Orchard Street, Madison, WI 53715, USA

Supplementary Material Supplementary Figures S1-S5 Supplementary Table S1 Supplementary Text S1 Supplementary Item S1 is available as a separate file

Sequence data FASTA format

Divergent mode (default)

Hybrid mode

Conserved mode

Cluster sequences at 20% identity

Cluster sequences at 20% identity

Identify representative sequence

USEARCH or VSEARCH

USEARCH or VSEARCH

User selected or USEARCH

Align each cluster

For each cluster separately…

Extract full-length sequences

MAFFT

Metaxa2

Build two HMMs for each cluster

Align sequences

HMMER

MAFFT

Extract matching input sequences

Taxonomy file

Trim sequence ends

Metaxa2, FASTA, ASN or XML format

Metaxa2

Re-align sequences MAFFT

Filter taxonomy data (Optional)

Find conserved regions

Cross-check sequences and taxonomy data

Align conserved regions individually MAFFT

Build BLAST database BLAST

Build HMM for each conserved region HMMER

Determine identity thresholds for taxonomic levels

Extract matching input sequences

(Unless supplied by the user)

Metaxa2

Final Metaxa2 Database Including HMMs, a BLAST database, and taxonomic information

Evaluate database performance (Optional)

Evaluation output

Supplementary Fig. S1. The Metaxa2 Database Builder workflow. All steps running USEARCH can also be performed with VSEARCH, depending on which of the software packages is detected by the database builder

Conserved mode

A)

100%

90%

90%

80%

80%

70%

70%

60%

60%

50%

50%

40%

40%

30%

30%

20%

20%

10%

10%

0%

B) 100% 90%

0% Domain Phylum

rpb2 matK

Class

Order

16S rRNA

ITS2

rbcL

trnH

cpn60

EF1alpha

Divergent mode

trnL

Family

E)

rpb1

90%

rpb1

70%

60%

60%

50%

50%

40%

40%

30%

30%

20%

20%

10%

10%

90%

0% Domain Phylum

rpb2 matK trnL

Class

Order

16S rRNA Hybrid mode

Family

ITS2

rbcL

trnH

cpn60

EF1alpha

rpb1

F)

trnL

90%

rpb1

80% 70%

60%

60%

50%

50%

40%

40%

30%

30%

20%

20%

10%

10% 0% Order

Family Genus

rpb2

16S rRNA

ITS2

matK

rbcL

trnH

trnL

cpn60

EF1alpha

rpb1

matK

100%

70%

Domain Phylum Class

rpb2

Class

Family

ITS2

cpn60

EF1alpha

Class

Order

16S rRNA

Hybrid mode rbcL cpn60

Domain Phylum Class rpb2

Order

16S rRNA

Genus

Divergent rbcL mode trnH

Domain Phylum

Genus

80%

0%

matK trnL

80%

100%

rpb2

100%

70%

C)

Domain Phylum

Genus

80%

0%

Conserved mode

D)

100%

Family

Genus

ITS2 trnH EF1alpha

Order

Family Genus

16S rRNA

ITS2

matK

rbcL

trnH

trnL

cpn60

EF1alpha

rpb1

Supplementary Fig. S2. Proportion of successfully detected and correctly assigned sequences at different taxonomic levels for each tested DNA barcoding region for databases built in the conserved (a), divergent (b) and hybrid (c) modes. Proportion of incorrectly assigned sequences at different taxonomic levels for databases built in the conserved (d), divergent (e) and hybrid (f) modes. Proportions were calculated based on the internal self-evaluation procedure of the Metaxa2 Database Builder.

Proportion of input sequences included in database

100% 80% 60% 40% 20% 0%

Conserved mode

Divergent mode

Hybrid mode

Supplementary Fig. S3. Proportion of sequences in the input datasets that were included in the final classification databases for the different operating modes.

A)

B)

Proportion correct sequences 1

0.15 Input evaluation

Input evaluation

0.8 0.6 0.4

0.1

0.05

0.2 0

0 0

C)

0.2

0.4 0.6 0.8 Internal evaluation

1

0

0.8

0.2

0.3 Input evaluation

Input evaluation

0.05 0.1 0.15 Internal evaluation

D) Proportion overclassified sequences 0.4

Proportion non-classified sequences 1

0.6 0.4

0.2

0.1

0.2 0

Proportion correct sequences 0 1 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 Internal evaluation Proportion sequences excluded 0.8

Input evaluation

0

E)

1

Proportion correct sequences 0.6

1 0.8 Input evaluation

Proportion incorrect sequences

0.2

Conserved Divergent

0.4

Hybrid 0.2

0.6

0

0.4

0

0.5 Internal evaluation

1

0.2 0 0

0.2 0.4 0.6 0.8 Included x Correct proportion

1

Supplementary Fig. S4. Comparison between the internal software evaluation (x-axes) and the corresponding evaluations of 150 nt sequence fragments (input evaluation; y-axes) for the three different operating modes. A) Correct proportions reported by the self-evaluation compared to the actual proportion of correct fragments from all input sequences, on the family level. B) Proportions of incorrect family assignments. C) Proportions of sequences not assigned at the family level. D) Proportions of sequences not included in the built databases compared to the overclassification rates for the fragments (i.e. sequence fragments belonging to a family not

present in the final database, but still assigned to a family by Metaxa2). E) Proportion of correctly assigned sequences, multiplied with the proportion of sequences included in the final classification databases, compared to the actual proportion of correct fragments from all input sequences, on the family level.

Supplementary Fig. S5. Comparison of the Metaxa2 performance on the SSU rRNA sequence fragments used for the original software evaluation of Metaxa2 for the native Metaxa2 database (a), and databases constructed using the Metaxa2 Database Builder (b), based on SILVA release 111 (c) and SILVA release 128 (e) using sequence filtering, and without filtering (d and f). PE denotes paired sequence fragments, similar to paired-end reads. Sequence fragments were regarded as correctly classified if all taxonomic levels reported by Metaxa2 corresponded to the known taxonomy of the input sequence that the fragment was derived from. For ‘perfect’ entries,

classifications corresponded exactly and entirely to the known taxonomic affiliation at every investigated taxonomic level. If any incorrect taxonomic assignments were made at any taxonomic level, the fragment was regarded as misclassified.

Supplementary Table S1 Barcoding region 16S rRNA ATP9-NAD9

Taxonomic coverage Bacteria Phytophthora species, from the phylum Oomycota

cpn60 EF1alpha ITS2

Lactobacilli Fungi Vascular plants

matK

Vascular plants

rbcL

Vascular plants

rpb1 rpb2 trnH

Fungi Fungi Vascular plants

trnL

Vascular plants

Dataset description Type strains and cultivated strains in SILVA release 128 Phytophthora species curated sequences including 140 different species/hybrids (a total of 123 species are currently described http://www.phytophthoradb.org) Group I sequences present in the cpnDB (http://www.cpndb.ca) Fungal sequences from all major branches Plant sequences, filtered to only contain sequences from plants found in Ohio Plant sequences, filtered to only contain sequences from plants found in Ohio Plant sequences, filtered to only contain sequences from plants found in Ohio Fungal sequences from all major branches Fungal sequences from all major branches Plant sequences, filtered to only contain sequences from plants found in Ohio Plant sequences, filtered to only contain sequences from plants found in Ohio

Data source SILVA NCBI Nucleotide

Ref. 32 35

cpnDB NCBI Nucleotide NCBI Nucleotide

34 31 16

NCBI Nucleotide

16

NCBI Nucleotide

16

NCBI Nucleotide NCBI Nucleotide NCBI Nucleotide

31 31 16

NCBI Nucleotide

16

Supplementary Text S1. Step-by-step usage guide for the Metaxa2 Database Builder This is a step-by-step guide to recreate the rpb2 database, to show readers how to use the Metaxa2 Database Builder. The guide assumes that you have Metaxa2 and all its dependencies (HMMER, BLAST, MAFFT, USEARCH) installed before starting. Instructions on installing these can be found in the software manual (Supplementary Item 1). 1) Obtain high-quality sequences for the genetic marker in question. For rpb2, we used reference sequences from James et al. 2006 (see supplementary table 1; https://media.nature.com/original/nature-assets/nature/journal/v443/n7113/extref/nature05110-s1.pdf) downloaded from NCBI GenBank in FASTA format. 2) For the same sequence set, obtain high-quality taxonomic information. For rpb2, we used the NCBI GenBank records corresponding to the sequences above, downloaded in GBK format. 3) If possible, manually curate the taxonomic data to remove obvious erroneous entries (sequences without taxonomic data will be removed automatically by the database builder). For the rpb2 data, we also converted the taxonomy information to Metaxa2 native format (tab-separated, see manual) in this step, but this is not a requirement for software operation. 4) Run the Metaxa2 database builder using the following command: metaxa2_dbb -e rpb2.fasta -t rpb2_taxonomy_filtered.txt -o rpb2_DIV -g rpb2 --auto_rep T --cpu 4 --save_raw T --filter_uncultured F --correct_taxonomy T --evaluate T --mode divergent

This command will create a Metaxa2 database for rpb2 using the divergent (default) mode. Using the “--auto_rep T” option tells the database builder to automatically select an appropriate reference sequence in the input data. The “--save_raw T” option can be useful if an error happens, as it will save all the intermediate files created in the database construction process. Note that we instruct the program to attempt to correct and standardize the taxonomic information (“--correct_taxonomy T”) but not to remove sequences associated with uncultured species (“--filter_uncultured F”). Depending on the input data, different options could be useful here. Finally, we also use the “--evaluate T” to self-evaluate the database. This is important as this result will be used later in this guide to select which mode produced the best database. 5) When the command has finished, run the program again, but this time using the conserved mode (note the changed output file name): metaxa2_dbb -e rpb2.fasta -t rpb2_taxonomy_filtered.txt -o rpb2_CON -g rpb2 --auto_rep T --cpu 4 --save_raw T --filter_uncultured F --correct_taxonomy T --evaluate T --mode conserved

6) Since we now have self-evaluation results from both the conserved and divergent modes, we can compare the performances of the two methods to select the best database. To do

so, find the information in the file “evaluation_statistics.0.1.txt” (inside the database directory) and copy it e.g. to an Excel spreadsheet. Do the same for both modes. Divergent Taxlevel

Conserved Correct

Incorrect

Missing

Taxlevel

Correct

Incorrect

Missing

1

1

0

0

1

1

0

0

2

0.87222222

0.01666667

0.11111111

2

0.88

0

0.12

3

0.86111111

0.01111111

0.12777778

3

0.88

0

0.12

4

0.6

0.01666667

0.38333333

4

0.74

0.02

0.24

5

0.53333333

0

0.46666667

5

0.6

0

0.4

6

0.46666667

0.01666667

0.51666667

6

0.52

0

0.48

7

0.31111111

0.02777778

0.66111111

7

0.36

0

0.64

You should now have a table looking somewhat like the one above. 7) Next find the number of input sequences, and the number actually included in the final database. This can be done using the following commands: grep -c “>” rpb2.fasta grep -c “>” rpb2_DIV/blast.fasta grep -c “>” rpb2_CON/blast.fasta

In the rpb2 case, the numbers were 180 (input file), 180 (divergent mode database) and 59 (conserved mode database). 8) Multiply the numbers in the evaluation spreadsheet with the proportion of sequences included in each database mode (for rpb2 that would be 1.0 for the divergent mode and 0.328 for the conserved mode). This results in the following table: Divergent Taxlevel

Conserved Correct

Incorrect

Missing

Taxlevel

Correct

Incorrect

Missing

1

1

0

0

1

0.328

0

0

2

0.87222222

0.01666667

0.11111111

2

0.28864

0

0.03936

3

0.86111111

0.01111111

0.12777778

3

0.28864

0

0.03936

4

0.6

0.01666667

0.38333333

4

0.24272

0.00656

0.07872

5

0.53333333

0

0.46666667

5

0.1968

0

0.1312

6

0.46666667

0.01666667

0.51666667

6

0.17056

0

0.15744

7

0.31111111

0.02777778

0.66111111

7

0.11808

0

0.20992

For this combined measure of performance (see main manuscript), the divergent mode performs vastly better at all taxonomic levels, and hence we would select this database for classification.

9) To use the database, put in the “metaxa2_db” directory in the same directory as the Metaxa2 software. Then run Metaxa2 with the -g option, like this: metaxa2 -i sequences_to_classify.fasta -g rpb2_DIV -o metaxa2_output --cpu 8

Note that the “-g” option must be specified to the name of the database directory, not the actual gene. However, you can of course rename that directory “rpb2” before adding it to the metaxa2_db directory. Aside from the fact that the “-g” option must be specified, Metaxa2 will work as usual for the new database.