Metaxa2 Database Builder: Enabling taxonomic identification from metagenomic and metabarcoding data using any genetic marker Johan Bengtsson-Palme*, Rodney T. Richardson, Marco Meola, Christian Wurzbacher, Émilie D. Tremblay, Kaisa Thorell, Kärt Kanger, K. Martin Eriksson, Guillaume J. Bilodeau, Reed M. Johnson, Martin Hartmann, R. Henrik Nilsson * Address correspondence to
[email protected]; +46 31 342 46 26; Department of Infectious Diseases, Institute of Biomedicine, The Sahlgrenska Academy, University of Gothenburg, Guldhedsgatan 10, SE-413 46, Gothenburg, Sweden; Center for Antibiotic Resistance research (CARe) at University of Gothenburg, Box 440, SE-40530, Gothenburg, Sweden; Wisconsin Institute for Discovery, University of Wisconsin-Madison, 330 N. Orchard Street, Madison, WI 53715, USA
Supplementary Material Supplementary Figures S1-S5 Supplementary Table S1 Supplementary Text S1 Supplementary Item S1 is available as a separate file
Sequence data FASTA format
Divergent mode (default)
Hybrid mode
Conserved mode
Cluster sequences at 20% identity
Cluster sequences at 20% identity
Identify representative sequence
USEARCH or VSEARCH
USEARCH or VSEARCH
User selected or USEARCH
Align each cluster
For each cluster separately…
Extract full-length sequences
MAFFT
Metaxa2
Build two HMMs for each cluster
Align sequences
HMMER
MAFFT
Extract matching input sequences
Taxonomy file
Trim sequence ends
Metaxa2, FASTA, ASN or XML format
Metaxa2
Re-align sequences MAFFT
Filter taxonomy data (Optional)
Find conserved regions
Cross-check sequences and taxonomy data
Align conserved regions individually MAFFT
Build BLAST database BLAST
Build HMM for each conserved region HMMER
Determine identity thresholds for taxonomic levels
Extract matching input sequences
(Unless supplied by the user)
Metaxa2
Final Metaxa2 Database Including HMMs, a BLAST database, and taxonomic information
Evaluate database performance (Optional)
Evaluation output
Supplementary Fig. S1. The Metaxa2 Database Builder workflow. All steps running USEARCH can also be performed with VSEARCH, depending on which of the software packages is detected by the database builder
Conserved mode
A)
100%
90%
90%
80%
80%
70%
70%
60%
60%
50%
50%
40%
40%
30%
30%
20%
20%
10%
10%
0%
B) 100% 90%
0% Domain Phylum
rpb2 matK
Class
Order
16S rRNA
ITS2
rbcL
trnH
cpn60
EF1alpha
Divergent mode
trnL
Family
E)
rpb1
90%
rpb1
70%
60%
60%
50%
50%
40%
40%
30%
30%
20%
20%
10%
10%
90%
0% Domain Phylum
rpb2 matK trnL
Class
Order
16S rRNA Hybrid mode
Family
ITS2
rbcL
trnH
cpn60
EF1alpha
rpb1
F)
trnL
90%
rpb1
80% 70%
60%
60%
50%
50%
40%
40%
30%
30%
20%
20%
10%
10% 0% Order
Family Genus
rpb2
16S rRNA
ITS2
matK
rbcL
trnH
trnL
cpn60
EF1alpha
rpb1
matK
100%
70%
Domain Phylum Class
rpb2
Class
Family
ITS2
cpn60
EF1alpha
Class
Order
16S rRNA
Hybrid mode rbcL cpn60
Domain Phylum Class rpb2
Order
16S rRNA
Genus
Divergent rbcL mode trnH
Domain Phylum
Genus
80%
0%
matK trnL
80%
100%
rpb2
100%
70%
C)
Domain Phylum
Genus
80%
0%
Conserved mode
D)
100%
Family
Genus
ITS2 trnH EF1alpha
Order
Family Genus
16S rRNA
ITS2
matK
rbcL
trnH
trnL
cpn60
EF1alpha
rpb1
Supplementary Fig. S2. Proportion of successfully detected and correctly assigned sequences at different taxonomic levels for each tested DNA barcoding region for databases built in the conserved (a), divergent (b) and hybrid (c) modes. Proportion of incorrectly assigned sequences at different taxonomic levels for databases built in the conserved (d), divergent (e) and hybrid (f) modes. Proportions were calculated based on the internal self-evaluation procedure of the Metaxa2 Database Builder.
Proportion of input sequences included in database
100% 80% 60% 40% 20% 0%
Conserved mode
Divergent mode
Hybrid mode
Supplementary Fig. S3. Proportion of sequences in the input datasets that were included in the final classification databases for the different operating modes.
A)
B)
Proportion correct sequences 1
0.15 Input evaluation
Input evaluation
0.8 0.6 0.4
0.1
0.05
0.2 0
0 0
C)
0.2
0.4 0.6 0.8 Internal evaluation
1
0
0.8
0.2
0.3 Input evaluation
Input evaluation
0.05 0.1 0.15 Internal evaluation
D) Proportion overclassified sequences 0.4
Proportion non-classified sequences 1
0.6 0.4
0.2
0.1
0.2 0
Proportion correct sequences 0 1 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 Internal evaluation Proportion sequences excluded 0.8
Input evaluation
0
E)
1
Proportion correct sequences 0.6
1 0.8 Input evaluation
Proportion incorrect sequences
0.2
Conserved Divergent
0.4
Hybrid 0.2
0.6
0
0.4
0
0.5 Internal evaluation
1
0.2 0 0
0.2 0.4 0.6 0.8 Included x Correct proportion
1
Supplementary Fig. S4. Comparison between the internal software evaluation (x-axes) and the corresponding evaluations of 150 nt sequence fragments (input evaluation; y-axes) for the three different operating modes. A) Correct proportions reported by the self-evaluation compared to the actual proportion of correct fragments from all input sequences, on the family level. B) Proportions of incorrect family assignments. C) Proportions of sequences not assigned at the family level. D) Proportions of sequences not included in the built databases compared to the overclassification rates for the fragments (i.e. sequence fragments belonging to a family not
present in the final database, but still assigned to a family by Metaxa2). E) Proportion of correctly assigned sequences, multiplied with the proportion of sequences included in the final classification databases, compared to the actual proportion of correct fragments from all input sequences, on the family level.
Supplementary Fig. S5. Comparison of the Metaxa2 performance on the SSU rRNA sequence fragments used for the original software evaluation of Metaxa2 for the native Metaxa2 database (a), and databases constructed using the Metaxa2 Database Builder (b), based on SILVA release 111 (c) and SILVA release 128 (e) using sequence filtering, and without filtering (d and f). PE denotes paired sequence fragments, similar to paired-end reads. Sequence fragments were regarded as correctly classified if all taxonomic levels reported by Metaxa2 corresponded to the known taxonomy of the input sequence that the fragment was derived from. For ‘perfect’ entries,
classifications corresponded exactly and entirely to the known taxonomic affiliation at every investigated taxonomic level. If any incorrect taxonomic assignments were made at any taxonomic level, the fragment was regarded as misclassified.
Supplementary Table S1 Barcoding region 16S rRNA ATP9-NAD9
Taxonomic coverage Bacteria Phytophthora species, from the phylum Oomycota
cpn60 EF1alpha ITS2
Lactobacilli Fungi Vascular plants
matK
Vascular plants
rbcL
Vascular plants
rpb1 rpb2 trnH
Fungi Fungi Vascular plants
trnL
Vascular plants
Dataset description Type strains and cultivated strains in SILVA release 128 Phytophthora species curated sequences including 140 different species/hybrids (a total of 123 species are currently described http://www.phytophthoradb.org) Group I sequences present in the cpnDB (http://www.cpndb.ca) Fungal sequences from all major branches Plant sequences, filtered to only contain sequences from plants found in Ohio Plant sequences, filtered to only contain sequences from plants found in Ohio Plant sequences, filtered to only contain sequences from plants found in Ohio Fungal sequences from all major branches Fungal sequences from all major branches Plant sequences, filtered to only contain sequences from plants found in Ohio Plant sequences, filtered to only contain sequences from plants found in Ohio
Data source SILVA NCBI Nucleotide
Ref. 32 35
cpnDB NCBI Nucleotide NCBI Nucleotide
34 31 16
NCBI Nucleotide
16
NCBI Nucleotide
16
NCBI Nucleotide NCBI Nucleotide NCBI Nucleotide
31 31 16
NCBI Nucleotide
16
Supplementary Text S1. Step-by-step usage guide for the Metaxa2 Database Builder This is a step-by-step guide to recreate the rpb2 database, to show readers how to use the Metaxa2 Database Builder. The guide assumes that you have Metaxa2 and all its dependencies (HMMER, BLAST, MAFFT, USEARCH) installed before starting. Instructions on installing these can be found in the software manual (Supplementary Item 1). 1) Obtain high-quality sequences for the genetic marker in question. For rpb2, we used reference sequences from James et al. 2006 (see supplementary table 1; https://media.nature.com/original/nature-assets/nature/journal/v443/n7113/extref/nature05110-s1.pdf) downloaded from NCBI GenBank in FASTA format. 2) For the same sequence set, obtain high-quality taxonomic information. For rpb2, we used the NCBI GenBank records corresponding to the sequences above, downloaded in GBK format. 3) If possible, manually curate the taxonomic data to remove obvious erroneous entries (sequences without taxonomic data will be removed automatically by the database builder). For the rpb2 data, we also converted the taxonomy information to Metaxa2 native format (tab-separated, see manual) in this step, but this is not a requirement for software operation. 4) Run the Metaxa2 database builder using the following command: metaxa2_dbb -e rpb2.fasta -t rpb2_taxonomy_filtered.txt -o rpb2_DIV -g rpb2 --auto_rep T --cpu 4 --save_raw T --filter_uncultured F --correct_taxonomy T --evaluate T --mode divergent
This command will create a Metaxa2 database for rpb2 using the divergent (default) mode. Using the “--auto_rep T” option tells the database builder to automatically select an appropriate reference sequence in the input data. The “--save_raw T” option can be useful if an error happens, as it will save all the intermediate files created in the database construction process. Note that we instruct the program to attempt to correct and standardize the taxonomic information (“--correct_taxonomy T”) but not to remove sequences associated with uncultured species (“--filter_uncultured F”). Depending on the input data, different options could be useful here. Finally, we also use the “--evaluate T” to self-evaluate the database. This is important as this result will be used later in this guide to select which mode produced the best database. 5) When the command has finished, run the program again, but this time using the conserved mode (note the changed output file name): metaxa2_dbb -e rpb2.fasta -t rpb2_taxonomy_filtered.txt -o rpb2_CON -g rpb2 --auto_rep T --cpu 4 --save_raw T --filter_uncultured F --correct_taxonomy T --evaluate T --mode conserved
6) Since we now have self-evaluation results from both the conserved and divergent modes, we can compare the performances of the two methods to select the best database. To do
so, find the information in the file “evaluation_statistics.0.1.txt” (inside the database directory) and copy it e.g. to an Excel spreadsheet. Do the same for both modes. Divergent Taxlevel
Conserved Correct
Incorrect
Missing
Taxlevel
Correct
Incorrect
Missing
1
1
0
0
1
1
0
0
2
0.87222222
0.01666667
0.11111111
2
0.88
0
0.12
3
0.86111111
0.01111111
0.12777778
3
0.88
0
0.12
4
0.6
0.01666667
0.38333333
4
0.74
0.02
0.24
5
0.53333333
0
0.46666667
5
0.6
0
0.4
6
0.46666667
0.01666667
0.51666667
6
0.52
0
0.48
7
0.31111111
0.02777778
0.66111111
7
0.36
0
0.64
You should now have a table looking somewhat like the one above. 7) Next find the number of input sequences, and the number actually included in the final database. This can be done using the following commands: grep -c “>” rpb2.fasta grep -c “>” rpb2_DIV/blast.fasta grep -c “>” rpb2_CON/blast.fasta
In the rpb2 case, the numbers were 180 (input file), 180 (divergent mode database) and 59 (conserved mode database). 8) Multiply the numbers in the evaluation spreadsheet with the proportion of sequences included in each database mode (for rpb2 that would be 1.0 for the divergent mode and 0.328 for the conserved mode). This results in the following table: Divergent Taxlevel
Conserved Correct
Incorrect
Missing
Taxlevel
Correct
Incorrect
Missing
1
1
0
0
1
0.328
0
0
2
0.87222222
0.01666667
0.11111111
2
0.28864
0
0.03936
3
0.86111111
0.01111111
0.12777778
3
0.28864
0
0.03936
4
0.6
0.01666667
0.38333333
4
0.24272
0.00656
0.07872
5
0.53333333
0
0.46666667
5
0.1968
0
0.1312
6
0.46666667
0.01666667
0.51666667
6
0.17056
0
0.15744
7
0.31111111
0.02777778
0.66111111
7
0.11808
0
0.20992
For this combined measure of performance (see main manuscript), the divergent mode performs vastly better at all taxonomic levels, and hence we would select this database for classification.
9) To use the database, put in the “metaxa2_db” directory in the same directory as the Metaxa2 software. Then run Metaxa2 with the -g option, like this: metaxa2 -i sequences_to_classify.fasta -g rpb2_DIV -o metaxa2_output --cpu 8
Note that the “-g” option must be specified to the name of the database directory, not the actual gene. However, you can of course rename that directory “rpb2” before adding it to the metaxa2_db directory. Aside from the fact that the “-g” option must be specified, Metaxa2 will work as usual for the new database.