Genes Genom (2014) 36:191–196 DOI 10.1007/s13258-013-0155-8
RESEARCH ARTICLE
A clustering method for next-generation sequences of bacterial genomes through multiomics data mapping Ho-Sik Seok • Mikang Sim • Daehwan Lee Jaebum Kim
•
Received: 2 September 2013 / Accepted: 24 October 2013 / Published online: 6 November 2013 Ó The Genetics Society of Korea 2013
Abstract With various ‘omics’ data becoming available recently, new challenges and opportunities are provided for researches on the assembly of next-generation sequences. As an attempt to utilize novel opportunities, we developed a next-generation sequence clustering method focusing on interdependency between genomics and proteomics data. Under the assumption that we can obtain next-generation read sequences and proteomics data of a target species, we mapped the read sequences against protein sequences and found physically adjacent reads based on a machine learning-based read assignment method. We measured the performance of our method by using simulated read sequences and collected protein sequences of Escherichia coli (E. coli). Here, we concentrated on the actual adjacency of the clustered reads in the E. coli genome and found that (i) the proposed method improves the performance of read clustering and (ii) the use of proteomics data does have a potential for enhancing the performance of genome assemblers. These results demonstrate that the integrative approach is effective for the accurate grouping of adjacent reads in a genome, which will result in a better genome assembly.
Electronic supplementary material The online version of this article (doi:10.1007/s13258-013-0155-8) contains supplementary material, which is available to authorized users. H.-S. Seok M. Sim D. Lee J. Kim Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Korea J. Kim (&) Department of Animal Biotechnology, UBITA Center for Biotechnology Research (CBRU), Konkuk University, Seoul 143-701, Korea e-mail:
[email protected]
Keywords Next-generation sequence assembly Escherichia coli Multiomics data Read clustering
Introduction Current next-generation sequencing (NGS) technologies generate a huge number of short read sequences, which should be assembled to recover the original genome sequences (Henson et al. 2012). However, the short length and low accuracy of the NGS read sequences make the task of de novo genome assembly very challenging (Shendure and Ji 2008). The de novo genome assembly algorithms typically use the overlapping information of the NGS read sequences and generate contigs (or scaffolds), which are still smaller than the original genome sequences, as their final products. These small genomic fragments can be further merged based on alignments against genome sequences of a closely related reference species. However, this approach cannot be used if there does not exist a good reference, and the alignment of genome assemblies is complicated by genomic rearrangements (Didelot et al. 2012). Here, we present a simple NGS read clustering method for bacterial genomes as a way to circumvent these difficulties. The clustered NGS reads can be used as complementary information to the NGS read overlap, which can be used to increase the accuracy of a genome assembly. Compared with eukaryotic genomes, bacterial genomes are much smaller (2–6 Mb) and they are usually haploid (Didelot et al. 2012). Furthermore bacterial genomes are mainly composed of protein coding regions so they may be good targets for testing new genome assembly ideas that utilize the information of primary structures of proteins. Our method aims to construct more accurate groups of
123
192
neighboring NGS reads by taking advantage of protein sequences, and it can also enlarge the available protein sequences through utilizing protein–protein interactions (PPIs). Because every three nucleotides in a DNA sequence encode one amino acid, it is possible to treat protein sequences as a portion of reference sequences through mapping of NGS reads against the protein sequences, whose idea is similar with the reference-based strategy for transcriptomes (Martin and Wang 2011). Here we assumed that it is possible to collect at least small amount of protein sequences of the target species by using experimental methods, such as protein mass spectrometry (Steen and Mann 2004; Lubeck et al. 2002), or searching for orthologous proteins in a closely related species (Waterhouse et al. 2013; Park et al. 2011). Because it is highly likely that we cannot collect all protein sequences of the target species, we enlarged the set of protein sequences by collecting relevant proteins from public PPI databases (Goll et al. 2008; Razick et al. 2008). To evaluate our method, we selected Escherichia coli (E. coli) as our target species. By using simulated NGS reads of E. coli, we constructed the clusters of the NGS reads and measured their accuracy. Due to short generation time and rather simple genome structure, E. coli has been actively researched so far, which resulted in accumulation of a huge volume of data (Blattner et al. 1997). Through the experiments, we found that (i) our method improves the performance of grouping of NGS reads and (ii) the use of proteomics data does have an impact on enhancing the accuracy of a genome assembly, suggesting that our method is simple yet has a great potential for contributing to generate highly reliable bacterial genome assemblies.
Materials and methods Data Our ultimate goal is to assemble the genome sequences of a target species. Before attempting whole genome sequence assembly, we aim to build clusters of potentially adjacent NGS reads. Our approach is different from others in the sense that we take advantage of both NGS read sequences as well as protein sequences related to the target species. We evaluated our method by using E. coli as our target species. We collected a genome sequence of E. coli K12 (substrain MG1655 U00096) from the National Center for Biotechnology Information (NCBI) database (Blattner et al. 1997) and simulated 13,918,800 reads (coverage 300x) using the program ART (Huang et al. 2012). From the initially simulated reads, we randomly selected 1,393,205 (*10 %) reads and used them as a training data
123
Genes Genom (2014) 36:191–196
set for intelligent read assignment rules (see the last subsection of Materials and methods). Among *4,000 protein sequences of E. coli in the Database of Interacting Proteins (Salwinski et al. 2004), 200 protein sequences were randomly selected and used as our initial protein set. Here we tried to mimic the case where we can only obtain a small portion of protein sequences of the target species. In order to enlarge the available protein information, we added additional proteins by using the PPI database (http://www. bacteriome.org). Specifically, we searched for interacting proteins of the randomly sampled 200 proteins, and found 276 additional proteins. As an attempt to show the performance of our method with different amount of protein sequences, similar evaluations were also repeated by using randomly chosen 100, 500, 1,000, and 2,000 protein sequences with 156, 427, 514, and 617 additional protein sequences found from the PPI database, respectively. Algorithm Typical genome sequence assemblers try to concatenate NGS reads based on overlap-layout-consensus or De Bruijn graph strategies (Nagarajan and Pop 2013), which rely on just genomic information. However recent publications of various ‘omics’ data have provided noble opportunities for functional analyses of genomes (Hawkins et al. 2010) as well as assembly of genome sequences. In the case of the genome assembly, the relationships between genomes and other related ‘omics’ data could be used as novel guidance on the assembly of NGS read sequences. Especially, protein–protein and other regulatory interaction network data in combination with mapping between DNA and protein sequences may provide virtual scaffolding information for regions of interest in a genome (Berger et al. 2013). Specifically, typical genome assembly approaches just rely on a set of reads R and read overlap information o. However, we introduce new entities q and s that represent a protein sequence and mapping information between NGS reads and a protein sequence respectively. Here we assumed that it is possible to collect protein sequences for a target species as described in the previous section. The mapping of NGS read sequences against protein sequences was done by the BLASTX program, which can search for protein database using a given DNA query (Altschul et al. 1990). As mapping results, BLASTX produces a number of outputs, which are collectively represented as s above and can be used as a determinant for the assignment of an NGS read into a protein. We will provide more details on how to use s as the determinant in the last subsection of Materials and methods. By using the above four random variables R, o, q, and s, and one additional for an NGS read m, we can construct a conditional probability of PðmjR; q; o; sÞ, which represents
Genes Genom (2014) 36:191–196
193
the likelihood of m that is assigned to the protein q. Because the o and s are obtained by processing a given (fixed) set of reads (R) and a protein (q) respectively, we simply represented the above conditional probability as:
Table 1 Overall read assignment procedure
PðmjR; q; o; sÞ ¼ Pðmjo; sÞ
Step 2
ð1Þ
To further simplify Eq. 1 and more focus on the read mapping on a protein sequence, which has not been well considered by existing genome assembly algorithms, we assumed a conditional independence of o and s given R and q, and obtained the following Eq. 2. Pðmjo; sÞ ¼ PðmjoÞPðmjsÞ
Step 1
Machine learning-based read assignment rules The core idea of our method is to find clusters of NGS reads that should be physically located in the same contigs as an output of a genome assembly method. We try to improve the accuracy of the read clustering by using not only the overlap information among reads themselves but also the relationship between reads and proteins obtained from the alignment of read sequences to protein sequences by BLASTX,
Read simulation Simulating a set of NGS reads
Step 3
Read and protein sequence sampling Randomly sampling sets of read and protein sequences
Step 4
ð2Þ
Eqution 2 is very primitive form of integrating genomics and proteomics data. Because we did not aim to rigorous statistical integration of the genomic and proteomic information, machine learning-based inference technique was used for the probability PðmjsÞ (see the last subsection of Materials and methods), and state-of-the-art statistical techniques were not used in our method (Fagan et al. 2007; Kohl et al. 2013). Here we note that the conditional independence assumption may not hold for all NGS reads because NGS reads mapped to the same protein sequence are more likely to have overlap between each other. However, we used that assumption in part because of our focus on handling adjacent NGS reads without overlap, and in part because of the simplification of Eq. 1. The applicability of the assumption was empirically proved by our evaluation. Table 1 describes the overall read assignment procedure for the probability PðmjsÞ in Eq. 2. We first collected relevant genome sequences, protein sequences, and PPI data of a target species. Then we randomly selected a set of (*10 %) NGS reads and protein sequences to use them as a training data set for a machine learning method for assigning reads to proteins (see the last subsection of Materials and methods). Based on the PPI database, we enlarged the protein data by incorporating interacting proteins. Finally, we aligned and assigned NGS reads to protein sequences by BLASTX and the abovementioned read assignment rules. At the end, we have read clusters, which are highly likely to be in the same contigs of a genome assembly output.
Data preparation Collecting relevant genome sequences, protein sequences, and PPI data of a target species
Protein data enlargement Adding more protein sequences by selecting relevant proteins, which interact with the proteins selected in step 3
Step 5
Dynamic read selection
Step 6
From the partial set of reads, find read-selection rules based on a machine learning method Alignment Aligning and assigning reads to protein sequences based on the selection rules form step 5
which provides various clues for assigning a read to a protein (Table 2). We did not use the clues in Table 2 just as they are but developed a novel machine learning method based on a decision tree (Mitchell 1997), which can tell us whether a particular read is placed in the region of a particular protein or not. Because of its readability, a decision tree method is widely used for extracting rules from a huge volume of data. Our decision tree was produced by a data-mining tool, WEKA (Hall et al. 2009) by using the training data set of read and protein sequences as described in the previous subsection. Because we knew the position of each read and each protein in a genome sequence from simulation and protein annotation information respectively, we were able to separate the correctly aligned reads from wrongly aligned ones, and generate similar numbers of positive and negative instances for training (the ratios of positive instances to negative instances were 1:1.05, 1:1.05, 1:0.98, 1:0.98 and 1:1.06 for the data sets with 100, 200, 500, 1,000, 2,000 initially chosen proteins respectively). Based on these training data sets, WEKA produced prediction rules for read assignment (see Table 3 for example rules for the data set with 200 initially chosen proteins). The data-mined prediction rules provide hints for the relative importance of used attributes (BLASTX outputs in our case). The attributes of ‘‘Expect’’ and ‘‘Identities’’ were always observed in all of the produced prediction rules, indicating that these two attributes are more important than others.
Results and discussion In a genome assembly, the prime interest is the question of how believable a final assembly is (assembly quality). We
123
194
Genes Genom (2014) 36:191–196
Table 2 BLASTX alignment outputs
Score (S)
A normalized score of the alignment between the query and the subject sequences taking account the weighted values of the number of identities/mismatches/gaps and the total length of the gaps. The higher the score, the better the alignment
Expect
The e-value is the probability of obtaining the alignment by chance
Identities
Number of identical residues/total number of residues in alignment
Mismatches
Number of mismatched residues # of identical residues þ # of residues considered to be conserved Total number of residues The number of amino acids in the gap regions are shown if gaps were introduced
Positives Gaps Alignment length q.start
Alignment length Mapping start position in a read
q.end
Mapping end position in a read
s.start
Mapping start position in a protein
s.end
Mapping end position in a protein
Table 3 Read assignment rules used in a decision tree for the data set with 200 initially chosen protein sequences Rule 1
If (expect B 0.026) & (identities B 65.62) & (11 \ mismatches B 12) & (alignment length [ 31) ? read mapping
Rule 2
If (expect B 0.026) & (identities [ 65.62) ? read mapping
Rule 3
If (expect [ 0.026) & (s.start B 1) & (42.42 \ identities B 70.59) & (q.end [ 96) ? read mapping
Rule 4
If (expect [ 0.026) & (s.start B 1) & (identities [ 70.59) ? read mapping
Rule 5
If (0.026 \ expect B 0.15) & (s.start [ 1) & (64.71 \ identities B 75) & (q.start B 88) & (q.end [ 97) ? read mapping
Rule 6
If (0.026 \ expect B 0.15) & (s.start [ 1) & (64.71 \ identities B 75) & (q.start [ 88) ? read mapping
Rule 7
If (expect [ 0.026) & (s.start [ 1) & (identities [ 86.36) & (score B 20.4) & (q.start B 3) & (alignment length [ 6) ? read mapping
Rule 8
If (expect [ 0.026) & (s.start [ 1) & (identities [ 86.36) & (score B 20.4) & (alignment length B 9) & (97 \ q.start B 99) ? read mapping
Rule 9
If (expect [ 0.026) & (s.start [ 1) & (identities [ 75) & (score B 20.4) & (q.start [ 97) & (alignment length [ 9) ? read mapping
Rule 10
If (0.026 \ expect B 0.26) & (s.start [ 1) & (score [ 20.4) & (75 \ identities B 78.12) & (q.start [ 76) ? read mapping
Rule 11
If (0.026 \ expect B 0.26) & (s.start [ 1) & (score [ 20.4) & (identities [ 78.12) ? read mapping
Rule 12
If (expect [ 0.26) & (s.start [ 1) & (identities [ 75) & (alignment length B 11) & (20.4 \ score B 20.8) & (s.start [ 46) ? read mapping
Rule 13
If (expect [ 0.37) & (s.start [ 1) & (identities [ 75) & (alignment length B 11) & (score [ 21.2) ? read mapping
Rule 14 Rule 15
If (expect [ 0.26) & (s.start [ 1) & (identities [ 75) & (score [ 20.4) & (alignment length [ 11) ? read mapping If (expect [ 0.026) & (s.start [ 1) & (score B 20.4) & (q.start B 3) & (75 \ identities B 86.36) & (s.end [ 357) ? read mapping
evaluated our method in terms of improving the assembly quality by examining how accurate the assignment of NGS reads to their original positions in a genome sequence. Because we knew all the necessary information about E. coli, we were able to compute true positive (TP), true negative (TN), false positive (FP) and false negative (FN) of read assignment to its original position. We used three evaluation measures, sensitivity (TP=ðTP þ FNÞ), specificity (TN=ðFP þ TNÞ) and accuracy (ðTP þ TNÞ=ðTP þ TN þ FP þ FNÞ). In addition, we compared our method with the case where we can only use the e-value of the BLASTX mapping for the read assignment, which is a typical usage of the BLAST tools.
123
In the case of the data set with 200 initially chosen protein sequences (see Materials and methods), as the e-value cutoff decreased, the specificity increased and it reached more than 0.9 with the e-value cutoff 0.1 (Fig. 1; Table 4). However, the sensitivity began to decrease with the e-value cutoff 0.1 and it dramatically went down to almost zero as the e-value cutoff decreased down to 1e-20. We also found a strong negative correlation between the accuracy and the e-value cutoff, and we could obtain more than 0.9 accuracy with the e-value cutoff less than or equal to 0.1 (Fig. 1). However, the high accuracy with very small e-value cutoff was obtained mainly because of dramatically
Genes Genom (2014) 36:191–196
195
Table 4 Details of read clustering performance shown in Fig. 1 E-value cutoff
TP
Our method
TN
FP
FN
Sensitivity
Specificity
Accuracy
51,622
7,654,173
87,717
994
0.9811
0.9887
0.9886
1e-20
15
7,741,875
15
52,601
0.0003
1.0000
0.9932
1e-10
42,467
7,687,677
54,213
10,149
0.8071
0.9930
0.9917
1e-5
47,678
7,661,492
80,398
4,938
0.9062
0.9896
0.9891
1e-2
50,175
7,613,972
127,918
2,441
0.9536
0.9835
0.9833
1e-1
50,880
7,528,952
212,938
1,736
0.9670
0.9725
0.9725
1
51,702
6,542,866
1,199,024
914
0.9826
0.8451
0.8461
TP (true positive) = a positive instance is correctly predicted, TN (true negative) = a negative instance is correctly predicted, FP (false positive) = a negative instance is incorrectly predicted, FN (false negative) = a positive instance is incorrectly predicted, Sensitivity = TP=ðTP þ FNÞ, Specificity = TN=ðTN þ FPÞ, Accuracy = ðTP þ TNÞ=ðTP þ TN þ FP þ FNÞ
Fig. 1 Comparison of read clustering performance by using the data set with 200 initially chosen protein sequences. Given read and protein sequences whose locations in a genome sequence are known, we aligned read sequences against protein sequences by using BLASTX and assigned reads to a particular protein by both six different e-value cutoffs and our machine learning-based method. The performance was compared by using three measures: sensitivity, specificity, and accuracy
decreased denominator in the equation of the accuracy, which was reflected in the very low sensitivity values. Contrary to the method just based on the e-value cutoff, our method achieved very high performance in terms of the above three measures (sensitivity 0.98, specificity 0.99, and accuracy 0.99; Fig. 1; Table 4). Similar performance of the method with just using the e-value cutoff could be obtained with e-value cutoffs 0.1 (sensitivity 0.97, specificity 0.97, and accuracy 0.97) and 0.01 (sensitivity 0.95, specificity 0.98, and accuracy 0.98). However, overall accuracy of our method was superior to them (the best accuracy with very high sensitivity and specificity). Similar trends were also observed for the data sets with 100, 500, 1,000, and 2,000 initially chosen protein sequences (Supplementary Fig. S1; Table S1). This indicates that our method does have an impact on improving the assignment of NGS reads to a protein, which can lead to a better genome assembly.
In summary, we developed a method to generate clusters of physically adjacent NGS reads in a genome by considering DNA–protein sequence mapping. By using E. coli as our target species, we evaluated our method using simulated NGS reads and collected protein sequences, and found that our method has a great potential for generating highly accurate read clusters, which can be used to construct an accurate genome assembly. Most existing de novo genome assemblers are just based on read overlap information and require high coverage read sequences as their input to obtain a reasonable quality assembly. However, read overlap information may not be enough for generating a high quality genome assembly, and moreover different genome assemblers have different ability to use the information. We believe that this inconsistency may be complemented by additional omics data as used in our method. The basic assumption of our method is the availability of protein sequences of a target species. Recent improvement of proteomics technologies has enabled us to sequence proteins of our interest (Ziady and Kinter 2009). Once we have at least a few number of protein sequences, we can collect more protein sequences by taking advantage of PPI databases. Even worse case is where we cannot generate any protein sequences of our target species. We can circumvent this case by using the sequences of orthologous proteins in a closely related species. One may argue that using genome sequences of a closely related species may be even better. However, the use of proteins, which are encoded by highly conserved genomic regions, may have similar or better performance. Although our method is limited to prokaryotes that have relatively large portion of coding DNA sequences (CDSs), it provides valuable insight for other fields of research. First of all, our method can be extended and used for producing highly reliable clusters of adjacent reads and these clusters could be utilized to improve the output of genome assemblers for species other than
123
196
Genes Genom (2014) 36:191–196
prokaryotes. In addition, the machine learning-based approach for predicting the best rules for read assignment can be applied to any applications that are based on the BLAST search. This was supported by the evaluations that showed better results than the approach relying on just the e-value cutoff. In the near future, we will expand our results in two directions. First, we will develop a novel genome assembler that takes advantage of the clustered reads. Second, we will incorporate more omics data that provides insights into the relationships among reads. With prokaryotes, CDS alone can explain large portion of genomes. However we need more information if we want to tackle more complex species. Acknowledgments This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2013R1A1A2008183). Conflict of interest
None.
References Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410 Berger B, Peng J, Singh M (2013) Computational solutions for omics data. Nat Rev Genet 14:333–346 Blattner FR, Plunkett G III, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF et al (1997) The complete genome sequence of Escherichia coli K-12. Science 277:1453–1474 Didelot X, Bowden R, Wilson DJ, Peto TEA, Crook DW (2012) Transforming clinical microbiology with bacterial genome sequencing. Nat Rev Genet 13:601–612 Fagan A, Culhane AC, Higgins DG (2007) A multivariate analysis approach to the integration of proteomics and gene expression data. Proteomics 7:2162–2171
123
Goll J, Rajagopala SV, Shiau SC, Wu H, Lamb BT, Uetz P (2008) MPIDB: the microbial protein interaction database. Bioinformatics 24:1743–1744 Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18 Hawkins RD, Hon CC, Ren B (2010) Next-generation genomics: an integrative approach. Nat Rev Genet 11:476–486 Henson J, Tischler G, Ning Z (2012) Next-generation sequencing and large genome assemblies. Pharmacogenomics 13:901–915 Huang W, Li L, Myers JR, Marth GT (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594 Kohl M, Megger DA, Trippler M, Meckel H, Ahrens M, Bracht T, Weber F, Hoffmann AC, Baba HA, Sitek B et al (2013) A practical data processing workflow for multi-omics projects. Biochim Biophys Acta 13:S1570. doi:10.1016/j.bbapap.2013.02. 029 Lubeck O, Sewell C, Gu S, Chen X, Cai DM (2002) New computational approaches for de novo peptide sequencing from MS/MS experiments. Proc IEEE 90:1868–1874 Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat Rev Genet 12:671–682 Mitchell TM (1997) Machine learning, 1st edn. McGraw-Hill Inc, New York Nagarajan N, Pop M (2013) Sequence assembly demystified. Nat Rev Genet 14:157–166 Park D, Singh R, Baym M, Liao C-S, Berger B (2011) IsoBase: a database of functionally related proteins across PPI networks. Nucl Acids Res 39:D295–D300 Razick S, Magklaras G, Donaldson IM (2008) iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics 9:405 Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The database of interacting proteins: 2004 update. Nucl Acids Res 32(suppl 1):D449–D451 Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145 Steen H, Mann M (2004) The ABC’s (and XYZ’s) of peptide sequencing. Nat Rev Mol Cell Biol 5:699–711 Waterhouse RM, Zdobnov EM, Tegenfeldt F, Li J, Kriventseva EV (2013) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucl Acids Res 41:D358–D365 Ziady AG, Kinter M (2009) Protein sequencing with tandem mass spectrometry. Methods Mol Biol 544:325–341