Organism identification using a genome sequence ... - CiteSeerX

RESEARCH REPORT

Organism identification using a genome sequence-independent universal microarray probe set Yuri Y. Belosludtsev1, Dawn Bowerman2, Ryan Weil2, Nishanth Marthandan2, Robert Balog2, Kevin Luebke2, Jonathan Lawson2, Stephen A. Johnston2, C. Rick Lyons3, Kevin O’Brien1, Harold R. Garner2, and Thomas F. Powdrill1 BioTechniques 37:654-660 (October 2004)

There has been increasing interest and efforts devoted to developing biosensor technologies for identifying pathogens, particularly in the biothreat area. In this study, a universal set of short 12- and 13-mer oligonucleotide probes was derived independently of a priori genomic sequence information and used to generate unique species-dependent genomic hybridization signatures. The probe set sequences were algorithmically generated to be maximally distant in sequence space and not dependent on the sequence of any particular genome. The probe set is universally applicable because it is unbiased and independent of hybridization predictions based upon simplified assumptions regarding probe-target duplex formation from linear sequence analysis. Tests were conducted on microarrays containing 14,283 unique probes synthesized using an in situ light-directed synthesis methodology. The genomic DNA hybridization intensity patterns reproducibly differentiated various organisms (Bacillus subtilis, Yersinia pestis, Streptococcus pneumonia, Bacillus anthracis, and Homo sapiens), including the correct identification of a blinded “unknown” sample. Applications of this method include not only pathological and forensic genome identification in medicine and basic science, but also potentially a novel method for the discovery of unknown targets and associations inherent in dynamic nucleic acid populations such as represented by differential gene expression.

INTRODUCTION There is a growing need for development of genomic, especially microbial sensor, technologies to allow early and accurate detection of potentially hazardous microorganisms. Sophisticated approaches using highly parallel biochip-type technologies are maturing and enable the design and execution of research projects not possible in the past (1–5). However, such approaches are difficult to implement when the analyte space is large and complex and nearly impossible in cases where sequence information is unknown. In addition, it remains difficult to select accurate and informative probes using linear computational scanning of genomic sequence and first principle assumptions of nucleic acid duplex formation (melting temperature, ionic strength considerations, mismatch tol1Vitruvius

654 BioTechniques

erance, etc.) (6). Herein is described an alternate approach to sequence-dependent methods that uses a systematic combinatorial approach to probe design and has the potential to serve as a universal array-based sensor. The significance of this project lies in the design of a universal sensor that, although based upon nucleic acid hybridization principles, does not require prior analysis of genomic sequence hybridization predictions to derive an informative array probe set. Sequence-specific probe design has historically been compromised in predicting true hybridization outcomes in any case, as a variety of hidden variables such as the secondary structure of analyte molecules cannot be accounted for (6,7). Here we describe the derivation, synthesis, and validation of a microarray probe set and its subsequent utility as a universal array for the differentiation of microbial genomes.

The application of this approach for the analysis of other more complex nucleic acid populations is also discussed. MATERIALS AND METHODS Probe Set Design The probe set contained a combination of 9015 12-mers and 5268 13-mers. Each individual probe in the probe set contained either 7 of 12 (58.3% GC) or 7 of 13 (53.8% GC) G or C bases. An algorithm was used to derive complete sets of 9015 12-mer probes where each probe differs by at least 4 bases and 5268 13-mer probes that differ by at least 5 bases. Briefly, the derivation of the probe set was accomplished by generating sequences at random. Each sequence so generated was included into or excluded from the probe set based on

Biosciences, The Woodlands, 2UT Southwestern Medical Center, Dallas, TX, and 3University of New Mexico, Albuquerque, NM, USA Vol. 37, No. 4 (2004)

the degree of their concordance as mentioned above with the already accepted members of the set. A complete list of the probe set is available online (http:// innovation.swmed.edu; see Supplementary Table 1). Fabrication of Arrays Fabrication of prototype microarrays was performed by an in situ synthesis technology, digital optical chemistry (DOC) as described previously (8–11). Each array contained 14,283 unique probes in duplicate of 12- or 13-mer in length (see Supplementary Figure 1 at http://innovation.swmed.edu). The typical stepwise synthesis yield is 96.3%, therefore 61% of the 13-mer probes are fabricated to full length. Sample Preparation and Hybridization Bacterial genomic DNAs (gDNAs)

were obtained for initial testing. We selected Streptococcus pneumoniae (BAA-334D; ATCC, Manassas, VA, USA), Bacillus subtilis (6633; ATCC), Bacillus anthracis (Ames strain), and Yersinia pestis. Y. pestis gDNA was isolated from a cultured clinical isolate. The gDNA of Y. pestis and B. anthracis were provided by C. Rick Lyons (University of New Mexico). Homo sapiens sample DNA was obtained commercially (1691112; Roche Applied Science, Indianapolis, IN, USA). To test reproducibility and label method independence, two techniques were used to label the gDNA. In the first technique, samples were first digested with restriction endonuclease MspI (CCGG) and subsequently labeled using the Alexa Fluor® 546 fluorophore with the ULYSIS® Nucleic Acid Labeling Kit (Molecular Probes, Eugene, OR, USA). A second group of samples were labeled using nick-translation (Promega, Madison, WI, USA) to incorporate Cy™3-

or Cy5-labeled oligonucleotides (Amersham Biosciences, Little Chalfont, Buckinghamshire, UK). Following the labeling reactions, excess nucleotides were removed either by Sephadex® column filtration or by ethanol precipitation. For those samples purified by column filtration, a Microcon® YM-30 filter device (Millipore, Bedford, MA, USA) was used for concentration. Ultimately, all samples were resuspended in 50 μL of hybridization buffer [5× standard saline citrate (SSC), 5× Denhardt’s solution]. Samples were then heat-denatured for 5 min at 95°C and snap-cooled in an ice bath before application to the arrays. Approximately 1.6 μg of labeled DNA in a 25-μL volume of hybridization buffer was applied to the array. Hybridization was performed overnight (16 h) at 23°C under a coverslip. After hybridization, arrays were washed three times with wash buffer (2× SSC), spun dry, and immediately imaged.

RESEARCH REPORT Imaging The fluorescence intensities from arrays hybridized with labeled gDNAs were detected, and images were recorded using an Axon Instruments GenePix® 4000B two-color array scanner (Molecular Devices, Sunnyvale, CA, USA) (see Supplementary Figures 2–10 at http://innovation.swmed.edu). Replicates using both types of labeling methods showed very little difference in the techniques used (data not shown). For repeats of the same organism and the same labeling method, the correlation coefficient for all data varied between 0.7 and 0.9. Organism replicates, using different labeling methods and reusing arrays after stripping them from previous hybridizations, resulted in an overall correlation coefficient as low as 0.5, however, as seen in the data, this did not affect the robustness of grouping replicates via cluster analysis. The probe intensities were extracted by a custom code, DOCUMENTOR, which superimposes a grid on the image of the spatially uniform array (probe set) produced by the DOC instrument. Any individual pixel intensities more than 2 standard deviations from the average for a feature are rejected, for they are typically fluorescing contamination. The raw values and statistics for each feature are then exported by the software.

Table 1. Number of Probes with 2-Fold Differences in the Hybridization Intensities Between Any Pair of Organisms

B. anthracis

Bacillus anthracis

Bacillus subtilis

Yersinia pestis

Streptococcus pneumoniae

Homo sapiens

59

150

195

939

850

8

139

947

297

494

822

369

43

891

B. subtilis Y. pestis S. pneumoniae H. sapiens

N.A.

The numbers of probes with greater than 2-fold differences in the hybridization intensities between any pair of organisms out of a total of 14,283 probes on the array. Note, that these differences are the minimum necessary to exclude one sample from the rest, given the reproducibility of the repeats. N.A., not applicable.

filtered and excluded from further analysis. The filtered data is clustered first by feature and then by sample using the standard correlation method. RESULTS AND DISCUSSION For the fabrication of the initial prototype universal arrays, an oligonucleotide probe length of approximately 12 to 13 bases was deemed to be optimal in differentiating microbial genomes of sizes from approximately 0.5 to 6.0 megabases based upon published work (12–14) and our own unpublished observations. A complete set of probes with all sequences represented for any particular oligonucleotide of length n contains 4n probes, which, for a 12-mer, would be approxi-

mately 16.8 million probes. Because the synthesis of such a large set would neither be practical nor necessary, our criteria for constructing an information content-optimized subset was based on the following assumptions and parameters: 1. The probe set is defined by GC base content. In the current design, each individual probe in the probe set contained either 7 of 12 (58.3% GC) or 7 of 13 (53.8% GC) G or C bases. Similar binding constants for all probes at any particular ionic strength or temperature are thus achieved. 2. The subset of probes is designed such that all probes are maximally nonhomologous to each other. For a finite probe set size, it is important to insure that related sequences are minimized.

Data Analysis The results were imported into GeneSpring® (Silicon Genetics Redwood City, CA, USA) for analysis. The internal replicates were averaged, and the data with values less than 2 standard deviations above no-hybridization controls were set to an arbitrary value, 0.01. The data was normalized by standard means, with the measurements normalized to the median of all the measurements in the sample. Then each feature is normalized to the median of the measurements in all the samples. The Global Error Model (as described in the application note, http://www.silicongenetics.com/ Support/GeneSpring/GSnotes/gem.pdf) was used to calculate a noise threshold. Probes that did not meet the threshold were excluded from further analysis. The replicate microarrays were then compared, and probes that showed poor reproducibility between replicates were 656 BioTechniques

�

� ��

��

��

��

��

��

Figure 1. Hybridization of labeled genomic DNA (gDNAs) to the universal array. (A) Zoomed images of a portion of the entire universal array containing 20 × 20 probes. Each probe is 68 × 102 μm. The organisms hybridized to each of the arrays are: upper left, Bacillus anthracis; upper right, Unknown; lower left, Streptococcus pneumonia; lower right, Homo sapiens. Note the similar pattern in the upper left and right, which correctly identifies the unknown as B. anthracis. (B) Cluster analysis shows different probe intensity signatures and demonstrates that replicate genomes are clustered together. Vol. 37, No. 4 (2004)

�

No. of Occurences of Each Probe with 2 Mismatches

Figure 2. The scatterplot shows the intensities of the unknown sample against the two closest matches. The dark gray diamonds represent the full set of probes that matched within 2-fold between Bacillus anthracis and the unknown sample. The probes that show at least a 2-fold change between the data sets are shown as black triangles. The gray squares illustrate larger mismatches between the unknown sample and the next closest match, Bacillus subtilis. The lines emanating from the origin approximate the 2-fold differences.

450 400 350 300 250 200 150 100 50 0 0

10,000

20,000

30,000

40,000

Probe Intensities

�

3E+18

Hybridization Factor (e G/RT )

2.5E+1 8 2E+18

1.5E+1 8 1E+18 5E+17 0 0

5000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Probe Intensities

Figure 3. Experimental and computed values for the Streptococcus pneumoniae genome. (A) The experimental intensity of the probe set for the S. pneumoniae genome does not correlate with either the number of occurrences of a probe in the genome as in graph. Only the data for probes with up to two mismatches are shown here, but perfect matches up to three mismatches have the same behavior. Similarly, the free energy hybridization factor as in graph (B) does not correlate. The hybridization factor = e-∆G/RT, where ΔG is the free energy per probe in kcal/mol, R is the universal molar gas constant, and T is the hybridization temperature. T = 295 K. Vol. 37, No. 4 (2004)

BioTechniques 657

RESEARCH REPORT Bacterial gDNAs were obtained for initial testing to evaluate the utility of the universal array probe set with regard to reproducibility and the ability to discriminate between organisms. We selected S. pneumoniae, B. subtilis, B. anthracis (Ames strain), and Y. pestis for initial testing. The genomes of these organisms as well as human gDNA were labeled either by direct chemical labeling incorporating the Alexa Fluor 546 fluorophore (15) or by a nick translation protocol incorporating Cy3 or Cy5 fluorophores (see Materials and Methods). In all cases, array hybridization patterns were found to be independent of the labeling method (data not shown). Hybridization of labeled gDNAs to the universal array demonstrated that patterns were reproducible among replicate samples and varied significantly among organisms as shown in Figure 1. Of 8366 probes that showed reproducible genomic intensity signatures, 4255 showed at least a 2-fold change in at least one sample. The cluster diagram (Figure 1B) confirms the strong concordance among replicate arrays and that the probe intensity signatures are unique to the tested organisms. From Table 1 it can be seen that a robust analysis with strong statistical power is enhanced by the judicious selection of a sufficiently large probe set. Although a small number of individual markers may not reproduce well, the differential signatures generated from the remainder of the probes on the array still have sufficient statistical power to assign an unknown to its most likely signature. The real power of genomic intensity signatures is to identify unknown organisms in the absence of other predictive data. As an initial test of this hypothesis, hybridization of a blinded unknown microbial gDNA sample was performed, and the array intensity matrix was compared to biosignatures generated from previous experiments (Figure 1A). As can be seen in the cluster diagram (Figure 1B), the unknown sample pattern strongly clustered with the B. anthracis genomic signature. This correlation proved to be correct and is clearly evident in the scatterplot of the B. anthracis versus the unknown sample in which only 59 probes showed a significant (2-fold or greater) difference between two samples (Figure 2). 658 BioTechniques

Table 1 shows the number of probes with 2-fold differences in the hybridization intensities between any pair of organisms. The significant differences in replicates are shown on the diagonal. Duplicate H. sapiens arrays were not done. It should be noted that even with the high number of differences seen in the replicate Y. pestis microarrays, the repeats were still clustered together. For the unknown, the next best match, B. subtilis, had 150 probes that are expected to show a significant difference between it and B. anthracis (Table 1). In practice, the unknown sample had 138 probes that showed a significant difference and only 12 probes that showed similarity with B. subtilis. With an error rate of less than 8%, we can see that the likelihood of a misclassification based on this data is very low. It is this differential that provides a great deal of the resolution, and predictive powers of this type of universal array method. Even small natural or intentional modifications of an organism’s genetic material are likely to alter genomic intensity signatures enough to be detected based upon this number of probes and the rational design of the set. The precise level of differentiation will be a function of several factors, including variance in the assay, the exact amount of the genomic difference, the hybridization conditions, and of course the complexity of the array. Although array signature patterns may unequivocally characterize and identify individual microbial genomes, individual probe intensity values may or may not correlate with simple hybridization models based upon the frequency of appearance of individual probe sequences in a sequenced genome. To answer this question, intensity predictions were modeled using two components. For each probe on the array, the number of occurrences of perfect (and 1 to 3) mismatches was computed for both strands. In addition, a free energy computation for each probe was performed (as in Reference 9). Neither component nor their compound (data not shown) accurately predicts the measured intensity values, thus underscoring the difficulty of predicting the performance of an array from only computational considerations, as mentioned in the Introduction. For example, the experimental and computed val-

ues for the S. pneumoniae genome are shown in Figure 3. In an extreme case, our probe number YC-13747, with the sequence CCGACTTCGTGA, had 24 perfect matches to the S. pneumoniae genome, the most of any probe, yet had an intensity of only 2500 and a free energy of -19.2 kcal/mol (hybridization factor of 1.8 × 1014 for T = 295 K). An implementation of this strategy may allow for the development of a reliable universal biosensor based upon a specific sequence-unbiased oligonucleotide probe set array that can detect and differentiate microbial genomes in the absence of specific sequence data (16). The construction of an archive of signatures including diagnostic fingerprints of mixtures, strains, and biovars is in progress. Although not addressed in this report, the sensitivity of this array-based method is commensurate with standard type microarrays [e.g., GeneChip® arrays from Affymetrix (Santa Clara, CA, USA)] (9), especially when amplification methods are used, which were avoided in this demonstration to escape any potential bias that may be introduced by such methods. It is envisioned that as more relevant clinical samples are processed, more methods for the random/semirandom amplification of nucleic acid targets will be employed (17). Applications of such devices may extend into many areas, including military and civilian biodefense, food and environmental microbial monitoring, and basic microbiology research. It is also applicable to un-sequenced genomes or strains of microbes that may have drifted or been intentionally engineered. However, the potential utility of this approach may extend well beyond microbial analysis. These initial studies suggest that a similar approach may in fact be an excellent forensics tool for the systemic study of more complex nucleic acid targets, such as those represented by human, animal, and plant genomes and transcriptomes. Given the complexity of these target populations and the inability of current technologies to assess them in a nonbiased systematic fashion, technologies may be developed using similar methods to those described in this letter to uncover associations not accessible by any other means. A very significant aspect of this sequenced unbiased approach may reside Vol. 37, No. 4 (2004)

RESEARCH REPORT in the potential for designing exploratory-type arrays that can be used for uncovering unknown patterns and individual messenger RNA (mRNA) species associated with differential cellular gene expression. Interrogation of the potential target space represented by expressed gene pools is very difficult from a priori probe design computations built upon either genomic sequence or cDNA sequences because of the dynamic nature of the transcriptome and the potential of many mRNA isoforms from one gene as a result of alternate splicing, polyadenylation, RNA editing, and as yet unknown mechanisms. As an example, it is believed that 60% of all human transcripts (mRNAs) are alternately spliced (18). Within this 60%, on average, each transcript has five isoforms or variants. In one case, a human gene (DSCAM, a homolog of a Drosophila spinal motor gene) has been found to contain 18 to 20 exons, with some of these exons having up to 16 variants each (19). When all the combinatorial splicing variants are calculated for this gene alone, 38,000 different splice variants are possible. Now, this is not to say that all genes will have thousands of isoforms, but the potential for variance across the entire transcriptome must be enormous. Thus, it is very difficult to impossible with current technologies or approaches to assess all these variants, as well as other differentially expressed but unknown genes because it requires a priori knowledge or a hypothesis about each of those individual events on which to design diagnostic probes. For these reasons, we are currently investigating the application of this sequence unbiased approach as described here to uncover higher levels of information hidden in the transcriptomic space and subsequently identify new biological associations and potential pharmaceutical targets. ACKNOWLEDGMENTS

This work was supported by Vitruvius Biosciences, Inc., National Institutes of Health (NIH)/National Cancer Institute (NCI) grant no. CA81656 and by contract DAAD13-02-C-079 from the Soldier Biological Chemical Command, 660 BioTechniques

Aberdeen Proving Ground (APG) to the Biological Chemical Countermeasures program of UTSW. This work was also supported by a grant from National Institute of Allergy and Infectious Diseases (NIAID) to H.R.G. through the Western Regional Center of Excellence for Biodefense and Emerging Infectious Disease Research, NIH grant no. U54 AI057156. COMPETING INTEREST STATEMENT

Y.Y.B. and T.F.P. are the founders of Vitruvius Biosciences, from which the concept described in this report originated and on whose behalf a patent has been filed. The remaining authors declare no conflicts of interest. REFERENCES 1.The Chipping Forecast II. 2002. Supplement to Nat. Genet. 32:461-551. 2.Southern, E.M. 2001. DNA microarrays. History and overview. Methods Mol. Biol. 170:115. 3.Kato-Maeda, M., Q. Gao, and P.M. Small. 2001. Microarray analysis of pathogens and their interaction with hosts. Cell. Microbiol. 3:713-719. 4.Cummings, C.A. and D.A. Relman. 2000. Using DNA microarrays to study host-microbe interactions. Emerg. Infect. Dis. 6:513-525. 5.Mir, K.U. and E.M. Southern. 2000. Sequence variation in genes and genomic DNA: methods for large-scale analysis. Annu. Rev. Genomics Hum. Genet. 1:329-360. 6.Mir, K.U. and E.M. Southern. 1999. Determining the influence of structure on hybridization using oligonucleotide arrays. Nat. Biotechnol. 17:788-792. 7.Kushon, S.A., J.P. Jordan, J.L. Seifert, H. Nielsen, P.E. Nielsen, and B.A. Armitage. 2001. Effect of secondary structure on the thermodynamics and kinetics of PNA hybridization to DNA hairpins. J. Am. Chem. Soc. 123:10805-10813. 8.Luebke, K.J., R.P. Balog, D. Mittelman, and H.R. Garner. 2002. Digital optical chemistry: a novel system for the rapid fabrication of custom oligonucleotide arrays. In R. Kordal, A. Usmani, and W.T. Law (Eds.), Microfabricated Sensors, Application of Optical Technology for DNA Analysis. American Chemical Society Publications. 9.Luebke, K.J., R.P. Balog, and H.R. Garner. 2003. Prioritized selection of oligonucleotide probes for efficient hybridization to RNA transcripts. Nucleic Acids Res. 31:750-758. 10.Balog, R.P., Y. Emi Ponce de Souza, H.M. Tang, G.M. DeMasellis, B. Gao, A. Avila, D.J. Gaban, D. Mittelman, et al. 2002. Parallel assessment of CpG methylation by two-

color hybridization with oligonucleotide arrays. Anal. Biochem. 309:301-310. 11.Garner, H.R. Digital optical chemistry micromirror imager. Patent Application No. 20020041420, Filed: November 29, 2001, Published: April 11, 2002. 12.Elhai, J. 2001. Determination of bias in the relative abundance of oligonucleotides in DNA sequences. J. Comp. Biol. 8:151-175. 13.Hsieh, L.C. and H.C. Lee. 2002. Model for the growth of bacterial genomes. Mod. Phys. Lett. B 16:821-827. 14.Van Dam, R.M. and S.R. Quake. 2002. Gene expression analysis with universal n-mer arrays. Genome Res. 12:145-152. 15.Haugland, R.P. 2002. Section 8.5. In R.P. Haugland (Ed.), Handbook of Fluorescent Probes and Research Chemicals, 9th ed. Molecular Probes, Eugene, OR. 16.Belosludtsev, Y. and T.F. Powdrill. Nucleic acid hybridzation-based biosensing devices and methods utilizing intelligently-designed oligonucleotide probe sets. Patent Application No. 10/327,782, Filed: December 23, 2002. 17.Wang, D., L. Coscoy, M. Zylberberg, P.C. Avila, H.A. Boushey, D. Ganem, and J.L. DeRisi. 2002. Microarray-based detection and genotyping of viral pathogens. Proc. Natl. Acad. Sci. USA 99:15687-15692. 18.Wang, H., E. Hubbell, J.-S. Hu, G. Mei, M. Cline, G. Lu, T. Clark, M.A. Siani-Rose, et al. 2003. Gene structure-based splice variant deconvolution using a microarray platform. Bioinformatics 19(Suppl 1):I315-I322. 19.Celotto, A.M. and B.R. Graveley. 2001. Alternative splicing of the Drosophila dsCAM pre-mRNA is both temporally and spatially regulated. Genetics 159:599-608.

Received 6 February 2004; accepted 16 June 2004. Address correspondence to: Harold R. Garner UT Southwestern Medical Center 5323 Harry Hines Blvd. Dallas, TX 75390-8591, USA e-mail: [email protected]

Vol. 37, No. 4 (2004)