Design, Validation, and Application of a Seven-Strain Staphylococcus ...

4 downloads 40010 Views 587KB Size Report
Feb 8, 2005 - Design, Validation, and Application of a Seven-Strain Staphylococcus ... comprehensive microarray necessitated a novel design strategy.
APPLIED AND ENVIRONMENTAL MICROBIOLOGY, Nov. 2005, p. 7504–7514 0099-2240/05/$08.00⫹0 doi:10.1128/AEM.71.11.7504–7514.2005 Copyright © 2005, American Society for Microbiology. All Rights Reserved.

Vol. 71, No. 11

Design, Validation, and Application of a Seven-Strain Staphylococcus aureus PCR Product Microarray for Comparative Genomics† Adam A. Witney,1 Gemma L. Marsden,1‡ Matthew T. G. Holden,2 Richard A. Stabler,1§ Sarah E. Husain,1 J. Keith Vass,3 Philip D. Butcher,1 Jason Hinds,1 and Jodi A. Lindsay4* Bacterial Microarray Group1 and Infectious Diseases,4 Department of Cellular and Molecular Medicine, St. George’s Hospital Medical School, London SW17 0RE, The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA,2 and Beatson Institute for Cancer Research, Garscube Estate, Bearsden G61 1BD,3 United Kingdom Received 8 February 2005/Accepted 15 June 2005

Bacterial comparative genomics has been revolutionized by microarrays, but the power of any microarray is dependent on the number and diversity of gene reporters it contains. Staphylococcus aureus is an important human pathogen causing a wide range of invasive and toxin-mediated diseases, and more than 20% of the genome of any isolate consists of variable genes. Seven whole-genome sequences of S. aureus are available, and we exploited this rare opportunity to design, build, and validate a comprehensive, nonredundant PCR product microarray carrying reporters that represent every predicted open reading frame (3,623 probes). Such a comprehensive microarray necessitated a novel design strategy. Validation with the seven sequenced strains showed correct identification of 93.9% of genes present or absent/divergent but was dependent on the method of analysis chosen. Microarray data were highly reproducible, reducing the need for many replicate slides. Interpretation of microarray data was enhanced by focusing on the major areas of variation—the presence or absence of mobile genetic elements (MGEs). We compiled “composite genomes” of every individual MGE and visualized their distribution. This allowed the sensitive discrimination of related isolates, including the first clear description of how isolates of the same clone of epidemic methicillin-resistant S. aureus differ substantially in their carriage of MGEs. These MGEs carry virulence and resistance genes, suggesting differences in pathogenic potential. The novel methods of design and interpretation of data generated from this microarray will enable further studies of S. aureus evolution, epidemiology, and pathogenesis. Staphylococcus aureus is a gram-positive bacterium that lives as a commensal of the human nose in as much as 80% of the population (20) but is also capable of causing infection when introduced into a wound. In hospitals, antibiotic-resistant strains such as methicillin-resistant S. aureus (MRSA) represent a substantial threat to patients with weakened immune systems and breaches of the skin. S. aureus can infect any tissue of the human body and is a leading cause of wound infection, surgical site infection, bacteremia, endocarditis, pneumonia, osteomyelitis, septic arthritis, and more. Toxins produced by S. aureus can also cause toxic shock syndrome, food poisoning, and scalded skin syndrome. S. aureus is also an important veterinary pathogen, particularly causing mastitis in dairy cows. Seven S. aureus isolates have been sequenced in their entirety (see Table 1). Comparisons of these genomes show a backbone of ordered conserved genes (the core genome) peppered with a range of genetic elements integrated at dozens of

insertion sites (25). These mobile genetic elements (MGEs) make up approximately 15 to 20% of the genome and include bacteriophages, pathogenicity islands, integrated plasmids, staphylococcal cassette chromosomes (SCC), genomic islands, transposons, and other “islets.” Many of the known or putative virulence and resistance genes carried by S. aureus are found on MGEs. Therefore, a comprehensive microarray that offers extensive coverage of the species, including MGE genes, and can rapidly identify the gene complement carried by any particular strain is a prerequisite for panspecies comparative genomic studies. These studies would enable the investigation of S. aureus variation, evolution, and epidemiology and would identify genes that are associated with pathogenic strains. Furthermore, a multistrain array would be useful for global gene expression studies with a wide range of strains. Already, comparative genomics using microarrays has successfully identified genetic differences between bacterial strains, including S. aureus strains. Indeed, Fitzgerald et al. used a microarray composed of approximately 92% of the S. aureus strain COL genome (2,817 open reading frames [ORFs]) to screen 36 clinical S. aureus isolates and found that 22% of the COL genome was missing in at least 1 of the test isolates (13). S. aureus microarrays with a subset of known virulence genes have also been described (27, 37) and highlight substantial variation between isolates. For other bacteria, including Bordetella spp. (8), Campylobacter jejuni (10, 23), Escherichia coli and Shigella strains (15), Helicobacter pylori (5, 36), Mycobacterium tuberculosis (4, 18), Neisseria gonorrhoeae (39),

* Corresponding author. Mailing address: Infectious Diseases, Department of Cellular and Molecular Medicine, St. George’s Hospital Medical School, London SW17 0RE, United Kingdom. Phone: 44 (0) 208 725 0445. Fax: 44 (0) 208 725 3487. E-mail: [email protected]. † Supplemental material for this article may be found at http: //aem.asm.org/. ‡ Present address: Department of Genetics, University of Leicester, University Road, Leicester LE1 7RH, United Kingdom. § Present address: Department of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, Keppel Street, London WC1E 7HT, United Kingdom. 7504

S. AUREUS SEVEN-STRAIN MICROARRAY

VOL. 71, 2005

7505

TABLE 1. Strains used for template DNA, for validation of the microarray, and for composition of the microarray gene pool Strain

MRSA252 N315 Mu50 COL 8325 MW2 MSSA476a RN4850 SA502A American 578 NCTC11939 pID408 PM11c 8325-4c

Description

Epidemic MRSA-16 clone from a United Kingdom hospital Hospital MRSA from Japan Hospital vancomycin intermediate-level resistant MRSA from Japan Early MRSA Laboratory strain Community MRSA from the United States Community MSSA from the United Kingdom Template for eta, etb Template for sed Template for see Template for ccrA type I Template for ermC (Tn917) EMRSA-16 clone from SGHMS Lab strain derived from 8325

Sequence source or reference

No. of predicted genes

No. of genes in gene pool

17

2,733

22 22

2,595 ⫹ 30 plasmid 2,710 ⫹ 31 plasmid

149 ⫹ 9 (⫹11 variants) 103 ⫹ 31

www.tigr.org www.genome.ou.edu/staph.html 3

3,630 ⫹ 4 plasmid 3,777 2,632 ⫹ 25 plasmid

387 ⫹ 4 (⫹5 variants) 87 (⫹2 variants) 60 ⫹ 4 (⫹4 variants)

17

2,587

21

2 1 1 1 1

2 1 1 1 1

20,760

3,626

b

M17357, M17348 M28521b M21319b 28 26 28

2,733 (⫹10 variants)

30

Total a

MSSA, methicillin-susceptible S. aureus. EMBL accession number(s). c Strain not used for microarray design. b

Porphyromonas gingivalis (7), Pseudomonas aeruginosa (42), Salmonella spp. (6, 32, 33, 40), Streptococcus pyogenes (38), Vibrio cholerae (12), Xylella fastidiosa (21), and Yersinia pestis (16), microarrays have been used to discover genes that are missing in clinical or laboratory strains. Many of these studies are limited by the composition of the microarray and are often able to make only a one-way comparison to discover the genes deleted or divergent in the test strain compared to the sequenced reference strain. This restricted microarray content, governed itself by sequence availability, means that such studies are unable to detect any genes specific to the test strain that are not present in the reference strain and therefore not represented on the array. However, microarrays that offer extensive coverage of a particular species by inclusion of multiple strain-specific sequences enable a two-way comparison to the reference strain that not only identifies deleted/divergent genes in the test strain but also detects the presence of genes specific to the test strain that are absent in the particular reference strain. Clearly, the wider the scope of the genes on the array, the more informative and comprehensive the comparative genomics studies that can be accomplished. Recently, Dunman et al. (11) developed an Affymetrix GeneChip containing oligonucleotides from six of the S. aureus genome-sequencing projects. The Affymetrix GeneChip represents a complementary yet distinct commercial technology platform based on proprietary design and construction methods (Affymetrix, Santa Clara, Calif.). Affymetrix GeneChips are based on short oligonucleotides of 20 to 25 bases that are synthesized in situ on silicon wafers through a photolithographic manufacturing process, with multiple oligonucleotides and control oligonucleotides for each gene represented. The spotted PCR product approach that we describe presents an

alternative in-house solution, and any apparent advantages of one approach over the other would depend on issues such as availability, access to associated hardware, flexibility in design required, and intended application. In this paper we describe (i) a new strategy for designing a nonredundant multistrain bacterial PCR product microarray, (ii) a novel approach to predicting the outcome of microarray hybridizations based on genome sequence analysis, (iii) validation of the observed microarray results based on the predicted results, to demonstrate the efficacy of the design and its potential application for identifying the presence or absence of genes in S. aureus isolates at a whole-genome level, and finally (iv) a method for interpreting comparative genomic microarray data that highlights MGEs, which are the most common variable between related S. aureus isolates. Furthermore, as a demonstration of the true value and practical benefit of a multistrain microarray, we prove that isolates thought to be clones of epidemic MRSA actually differ substantially in their carriage of MGEs. Such variation has not been documented before and may crucially impact on pathogenesis and treatment options. MATERIALS AND METHODS Strains and genome sequences. The strains used in this study are listed in Table 1. The complete genome sequences of seven strains of S. aureus are available; however, annotation was available for only five of these. The remaining two (COL and 8325) were annotated using Artemis software (35). Initial coding sequence predictions were performed using Orpheus and Glimmer2 software (9, 14). The two predictions were amalgamated, and codon usage (29), positional base preference methods, and comparisons to the nonredundant protein databases using BLAST and FASTA software were used to curate the predictions (2, 31). The entire DNA sequence was also compared in all six reading frames against the nonredundant protein databases, using BLASTX to identify any possible coding sequences previously missed. Plasmids from N315, Mu50, and

7506

WITNEY ET AL.

APPL. ENVIRON. MICROBIOL.

FIG. 1. Schematic of the design process for a multistrain microarray.

MW2 were included and treated as separate genomes in the design process. In addition, we designed primer pairs to cover variants of some known core genes that are implicated in virulence or horizontal gene transfer and that show significant variation. The genes were agr, ccrA, hsdS, coa, bctE, sdrD, sdrE, fnbpA, fnbpB, and haem (see the supplemental material). Microarray design. The multistrain microarray design strategy involved choosing one strain to act as a base strain and then adding genes from the other strains that are strain specific or show significant divergence from the genes in the base strain. A sequential process of strain comparisons thus builds up a gene pool that is representative of all seven strains. Selection of strain-specific or divergent genes from the remaining six strains was performed using a cumulative procedure, outlined in Fig. 1, based on sequence comparisons using BLAST (1). An initial gene pool was generated consisting of all predicted genes from S. aureus MRSA252. A BLAST database was created, consisting of the current gene pool plus genes from the first additional strain, N315, which was then searched with the genes from N315. Strainspecific or divergent genes present in N315 were identified by comparing the quality of the BLAST hit for an N315 gene against itself to the best-quality hit against a gene in the gene pool. A relatively poor-quality hit against a gene in the gene pool would indicate an N315-specific gene requiring addition to the gene pool for representation on the array. The BLAST bit score was used as the quality measure, because it reflects the length as well as the degree of sequence similarity. The comparison of BLAST hit quality was achieved by calculating the ratio of BLAST bit scores, whereby the bit score for a gene against itself is divided by the bit score of the best hit in the gene pool. N315 genes with a bit score ratio greater than 2 were highlighted as those genes to be added to the gene pool, with any borderline cases inspected manually to verify inclusion. Any N315 genes confirmed as strain specific or significantly divergent were added to the gene pool, and then the process was repeated with the next strain. Hence the gene pool grew in size with each successive analysis of a strain due to the addition of any genes not already represented in the gene pool by a gene from a previous strain. Strains were sequentially processed (from first to last, N315, Mu50, Col, 8325, MW2, and MSSA476), followed by plasmids pN315, pMu50, and pMW2. A summary of the composition of the gene pool is given in Table 1. Ten pairs of gene-specific primers were designed for each sequence in the resulting gene pool using Primer3 (34). The ultimate aim of the design process was to ensure that PCR products on the array represented orthologues in different strains and differentiated paralogues in the same strain, maximizing interstrain cross-hybridization but minimizing intrastrain cross-hybridization. Primers were designed to have a length of 20 to 25 bp and a melting temperature between 50 and 80°C and to produce an amplicon ranging from 100 to 800 bp, with an optimum size of 600 bp. From the potential PCR products designed for each gene in the gene pool, a single PCR product was selected for inclusion on the array. Two main criteria were used for selection of this PCR product based on BLAST bit score comparisons of the PCR products against genes. First, the PCR products were compared to all the genes from each strain to ensure that the PCR product selected matched orthologues in each strain as intended by the design, thus ensuring that each gene in every strain was represented on the array. Second, the PCR products were compared to each gene in the gene pool to make certain that the PCR product selected had minimal similarity to any other paralogues in the gene pool.

Microarray construction. Template DNA from each of the sequenced strains was prepared using a cesium chloride method as previously described (24) or Qiagen midi-preps with lysostaphin treatment (27). PCR primers were synthesized by MWG Biotech (Ebersberg, Germany) and supplied in 96-well format to enable high-throughput amplification of PCR products using a liquid handling and PCR amplification robot (RoboAmp 9600; MWG Biotech). Initially, standard 100-␮l PCRs were performed with 10 ng DNA template, 5 U HotStar Taq DNA polymerase (Qiagen), 0.5 ␮M primers, 1.5 mM MgCl2, and 200 ␮M deoxynucleoside triphosphates. Thermocycling was performed using an initial denaturation of 95°C for 15 min and 40 cycles of 95°C for 1 min, 52°C for 1 min, and 72°C for 1 min, followed by a final extension of 72°C for 5 min. All PCR products were verified by agarose gel electrophoresis to confirm amplification of a single product of the expected size. Modification of the standard PCR conditions was undertaken for any failures or multiple products by either adjusting the annealing temperatures or adding the PCR additive Solution Q (Qiagen) until single products of the correct expected size were obtained for all reactions. Five percent of the PCR products were sequenced to provide additional validation of the products, with aliquots of all PCR products stored to enable additional confirmatory sequencing of array elements in the future if required. Aliquots (50 ␮l) of the PCR products were concentrated by isopropanol precipitation, resuspended in 15 ␮l 50% dimethyl sulfoxide in H2O, and printed in duplicate onto GAPS slides (Corning) using a MicroGrid II arraying robot (BioRobotics, United Kingdom). The printed microarrays were post-print processed according to the slide manufacturer’s instructions (including rehydration, UV irradiation, chemical blocking, and denaturation steps) and then stored at room temperature in the dark prior to use. Labeling, hybridization, and data acquisition. Four micrograms each of chromosomal DNA from the test strain and the reference strain (MRSA252) was labeled with Cy3 and Cy5, respectively, using the method described previously (10). Microarray slides were prehybridized in 3.5⫻ SSC (1⫻ SSC is 0.15 M NaCl plus 0.015 M sodium citrate)–0.1% sodium dodecyl sulfate (SDS)–10 mg/ml bovine serum albumin at 65°C for 20 min before a 1-min wash in distilled water and a subsequent 1-min wash in isopropanol. Each Cy3-labeled test strain DNA was mixed with an equal amount of Cy5-labeled reference strain DNA, purified using a MiniElute kit (Qiagen), denatured, and mixed to achieve a final 45-␮l hybridization solution of 4⫻ SSC–0.3% SDS. Using two 22- by 22-mm LifterSlips (Erie Scientific), the microarray was sealed in a humidified hybridization chamber (Telechem International), hybridized overnight by immersion in a water bath at 65°C for 16 to 20 h. Slides were washed once in 400 ml 1⫻ SSC–0.06% SDS at 65°C for 2 min and twice in 400 ml 0.06⫻ SSC for 2 min. Slides were scanned using an Affymetrix 428 scanner, and signal data were extracted using ImaGene 5.2 (BioDiscovery). For each strain tested, two to four microarray slides were hybridized and analyzed. Data analysis. The ImaGene output for each array was loaded into GeneSpring 6.2 (Silicon Genetics) for normalization and analysis. For each spot, the median pixel intensity for the local background was subtracted from the median pixel intensity of the spot, and any values less than 1.0 were adjusted to 1.0. Background-subtracted pixel intensities for the test strain channel were divided by those for the reference strain channel. The resulting log ratios were normalized by applying the LOWESS intensity-dependent normalization, using 40% of the data to calculate the LOWESS fit for each point. If the value for the

VOL. 71, 2005 reference channel was less than 10, then a value of 10 was used instead. This normalization scheme counteracted any dye effects at low intensity and enabled interarray comparisons by equalizing the median ratio on each array. To account for any skew in the global normalization caused by an imbalance in the outlier populations, due to a disproportionate number of genes either deleted in or specific to a particular strain, an additional transformation was applied to the data. This additional transformation also facilitated some of the downstream analysis methods. A subset of core genes, defined as present in all strains analyzed, was determined by identifying those genes with a ratio between 0.5 and 2 on every single array. The ratio value for each gene was then divided by the median ratio of the core genes, ensuring that on all arrays the data were centered around the core genes which had a median ratio of 1. When data were combined for replicate arrays, the mean of the ratios was calculated with any genes flagged as empty by ImaGene excluded. Empty flags in ImaGene were assigned to spots with a signal intensity less than twice the local background intensity. In order to validate the array design process, these normalized ratio data were analyzed using three commonly applied methods to determine the presence or absence/divergence in the test or reference strain. All these approaches are essentially trying to classify genes into four groups: group 1, genes present in the test strain and absent/divergent in the reference strain; group 2, genes present in the reference strain and absent/divergent in the test strain; group 3, genes present in both strains; group 4, genes present in neither strain. In practice, most of the genes falling into the last category were flagged by ImaGene as absent because the fluorescent signal intensity was less than 2 times the background signal. These genes were not included in subsequent analysis. (i) Twofold cutoff. An arbitrary cutoff of twofold was used to identify those genes that are specific to one of the strains. Therefore, for all strains, the upper cutoff was set at a ratio of 2 and the lower cutoff at a ratio of 0.5. Genes with a ratio greater than the upper cutoff were deemed to be specific to the test strain, genes with a ratio less than the lower cutoff were deemed to be specific to the reference strain, and genes with ratios between 0.5 and 2 were deemed to be present in both strains. (ii) 3SD. Rather than using a fixed-value cutoff for all arrays as above, a cutoff based on the variation in the ratio data of the core genes was determined for each strain. For each strain, the standard deviation of ratios for genes within the subset of core genes was calculated to measure variation in the data, and then the ratio cutoffs for each strain were set at 3 standard deviations (3SD) on either side of the median value. The standard deviation was calculated for each test strain independently. For N315, 3SD was calculated to correspond to an upper cutoff of 1.59-fold difference and a lower cutoff of 0.65-fold; for other strains, the corresponding cutoffs were 1.69- and 0.60-fold (Mu50), 1.55- and 0.65-fold (COL), 1.59- and 0.64-fold (8325), 1.53- and 0.66-fold (MW2), and 1.69- and 0-fold (476). Genes with a ratio greater than the upper cutoff were deemed to be specific to the test strain, genes with a ratio less than the lower cutoff were deemed to be specific to the reference strain, and genes with ratios between the cutoffs were deemed to be present in both strains. (iii) GACK. GACK software uses the distribution of the ratio data for each strain to classify genes based on the probability that a gene is either present or absent/divergent (19). The software was set to generate a binary output using a threshold of 50% estimated probability of being present (EPP) in order to split the data into two groups, namely, present or absent genes. At 50% EPP there is equal probability that the gene will be called present either correctly or incorrectly. The chosen threshold depends on the relative importance of false positives/negatives for the data set under analysis; a higher threshold will determine more genes to be strain specific, thus increasing the chance of identifying all the truly absent genes but at the cost of miscalling genes that are in fact present. A lower threshold will determine fewer genes to be strain specific, thus increasing the chance of identifying all the truly present genes but at the cost of miscalling genes that are in fact absent. A threshold of 50% was chosen because it should give a balance of false positives/negatives to compare with the other methods tested. However, GACK was originally designed to work for one-way comparisons, that is, to identify only genes that are deleted in the test strain. Using the software in this way would mean that any genes present in the test strain and absent/divergent in the reference strain would be incorrectly classified as present in both strains. To overcome this, the normalized ratio data for each strain were passed through GACK a second time using the inverse of the ratios. BLAST predictions of microarray results. To enable prediction of the expected microarray results from the sequence data, we first needed to classify genes as “present” or “absent” in each of the sequenced strains based on the sequencing projects. The sequence of every PCR product on the array was compared by BLAST (1) to the complete genome sequence plus any associated plasmid sequence for each of the seven sequenced strains. The complete genome sequence was used in this comparison, in contrast to the annotated gene se-

S. AUREUS SEVEN-STRAIN MICROARRAY

7507

quences, to ensure that all factors contributing to the hybridization of genomic DNA were accounted for. This avoids any discrepancies simply due to differences in gene prediction for each strain and also accounts for the contribution of intergenic regions or gene remnants to the final hybridization signal. In order to incorporate the potential for multiple hits of varying qualities in the BLAST comparison, a total bit score (a measure of similarity between two sequences that takes into account both the degree and the length of sequence similarity) (1) was calculated for each PCR product against each genome by summing the bit scores of each high-scoring sequence pair with an E value less than 0.001. Using these total bit scores for each PCR product for each strain, three ratios were calculated: ratio 1, the bit score for each strain against the bit score for the genome from which the PCR product was designed; ratio 2, the bit score for the reference strain against the bit score for the genome from which the product was designed; and ratio 3, the bit score for each strain against the bit score for the reference strain. If either ratio 1 or ratio 2 was greater than or equal to a threshold value of 2 (the threshold value used to determine strain-specific genes during the design process), then ratio 3 was used to determine if the gene was predicted to be present or absent/divergent in the test and/or reference strain; presence was predicted if ratio 3 was greater than the threshold value of 2, and absence was predicted if it was lower than the threshold value. Lists were generated using these criteria, which could be used for comparison with the analyzed microarray data. Sensitivity, specificity, PPV, and NPV. To validate the microarrays, we used several different methods to categorize whether genes were “unique to the test strain” (found in the test strain and not in the reference strain) or “unique to the reference strain” (found in the reference strain and not in the test strain). Once we had classified the genes based on microarray data, the results were compared to the BLAST predictions based on sequence analysis. The various lists were compared using the Venn diagram function in GeneSpring, with the likelihood of chance list similarity measured using hypergeometric probabilities. The abilities of the microarray to identify genes “unique to the test strain” and “unique to the reference strain” were calculated separately. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated as follows. Sensitivity is the proportion of truly unique genes by sequence prediction that were correctly identified by the microarray. It is calculated as the number of genes unique by both microarray and sequence prediction divided by all genes unique by sequence prediction. Specificity is the proportion of genes predicted to be truly not unique by sequence prediction that were correctly called not unique by the microarray. It is calculated as the number of genes not unique by both microarray and sequence prediction divided by all genes not unique by sequence prediction. PPV is the proportion of genes unique by microarray that are truly unique by sequence prediction. It is calculated as the number of genes unique by both microarray and sequence prediction divided by all genes unique by microarray. Note that subtracting the PPV from 100 gives the level of false positives. NPV is the proportion of genes called not unique by microarray that are truly not unique by sequence prediction. It is calculated as the number of genes not unique by both microarray and sequence prediction divided by all genes not unique by microarray. Note that subtracting the NPV from 100 gives the level of false negatives. We also calculated the proportion of genes “correctly” identified by the microarray for each sequenced isolate. This was the number of genes that were called by both microarray and BLAST as either unique to the test strain, unique to the reference strain, or found in both, divided by all the present or marginal genes. Composite genomes. A major aim of using the microarrays for comparative genomics is to easily identify when an isolate has acquired or lost MGEs. This is because the sequencing projects have suggested that S. aureus isolates belong to a finite number of closely related lineages but that within each lineage there is substantial variation in MGE carriage (25). Due to this variation, clustering of S. aureus isolates based on microarray data for all genes (including MGEs) is inappropriate, and these genes should be identified and removed. Since MGEs often carry resistance and virulence genes, the distribution of these elements in strains is of great interest. Furthermore, many MGEs prevent the host bacterium from acquiring related MGEs via “immunity.” Thus, the identification of an MGE and the genes on it provides information about virulence and resistance gene carriage, mechanisms and possible sources of horizontal transfer, and the susceptibility of the host strain to acquisition of new genes. We constructed composite genomes of each of the sequenced strains and each of the sequenced MGEs (5 plasmids, 13 bacteriophages, 6 S. aureus pathogenicity islands [SaPI], 6 SCC, 7 ␣ genomic islands [GI␣], 7 GI␤, 3 conjugative elements, and 3 transposons). This was necessary because of the nonredundant nature of the microarray design, such that only one PCR product was designed if the gene was found in more than one of the sequenced isolates or MGEs. First, we identified every ORF in the sequencing projects that had ⬎90% homology to

7508

WITNEY ET AL.

APPL. ENVIRON. MICROBIOL.

FIG. 2. Microarray data for hybridizations comparing the sequenced strains to the reference strain MRSA252. Scatter plots plot the signal intensity for the test (y axis) versus the reference (x axis) strain channels. Data are averages of four replicates, and only spots flagged by ImaGene as present or marginal in all four replicates are shown. Data points are colored according to the BLAST prediction based on the sequence: blue, predicted present in both the test and reference strains; red, predicted present in the test strain only; green, predicted present in the reference strain only. The green lines represent the twofold cutoffs applied to each strain for each of the analysis methods used.

each of the PCR products on the microarray. We then constructed a list (composite genome) of the best-match PCR products for each sequenced gene (in gene order in the genome) for each whole genome or MGE. This typically means that a composite MGE consists of PCR products taken from more than one strain of S. aureus. For example, the SaPI2(Mu50) composite genome consists of PCR products amplified from MRSA252 and N315 but not Mu50. Composite genome lists were imported into GeneSpring. For interpretation of each experiment, the presence or absence of genes within each composite genome was determined by color based on fluorescent signal intensity ratios and visualized using the GeneSpring “Ordered List” function. Isolates related to a sequenced isolate can be compared visually using the relevant composite genome, and deletions can be identified. To identify acquisitions, each of the 50 MGE composite genomes is rapidly searched to identify any novel MGEs or fragments of MGEs.

RESULTS Design and construction of the microarray. The aim was to build a nonredundant microarray that included one copy of each conserved gene instead of seven copies from each of the strains. Therefore, the design strategy involved choosing one “reference” strain and utilizing every ORF from this strain, then systematically identifying the absent and divergent ORFs in the remaining strains and adding them to a “pool of ORFs.” The strategy is illustrated in Fig. 1. S. aureus MRSA252 was selected as the base strain for the design, since it belongs to the clonal group of EMRSA-16 strains that emerged in the United Kingdom in the early 1990s but now cause approximately 50% of all MRSA infections in United Kingdom hospitals (28) and are found worldwide. A number of genes not found in the seven sequenced strains were also included, resulting in 3,626

ORFs (Table 1). Primers for each ORF were designed, and PCR products were amplified from their corresponding template DNAs and printed in duplicate on slides with a range of positive and negative control spots. Visualization of microarray results. The seven sequenced strains were each cohybridized with the MRSA252 reference strain, and the results of the hybridization intensities for the test and reference strain channels are visualized as scatter plots in Fig. 2 (fully annotated microarray data have been deposited in B␮G@Sbase [accession number E-BUGS-30 {http://bugs .sgul.ac.uk/E-BUGS-30}] and also Array Express [accession number E-BUGS-30]). In these scatter plots, the genes have been color coded according to BLAST sequence predictions to indicate whether they are present (red) or absent/divergent (green) in the test strain compared to the reference strain. Clearly, there is a strong association with genes predicted to be present or absent in the test strain based on sequence comparison and the corresponding signals on the microarray. As expected, the results of comparing the reference strain MRSA252 to itself show a high degree of correlation (R2 ⫽ 0.9664) (Fig. 2). In several strains, genes colored blue (predicted to be found in both the reference and test isolates by sequence comparison) appear among the genes colored red (predicted to be specific to the test strain). Some of these spots represent plasmids and transposon genes, found in both the test and reference strains, but at an increased copy number in the test isolate. For example, reference strain MRSA252 contains an

S. AUREUS SEVEN-STRAIN MICROARRAY

VOL. 71, 2005

FIG. 3. Frequency distribution histogram for the analysis of S. aureus MW2 showing the cutoffs used for the three methods of analysis employed. The histogram represents the number of genes (y axis) at each ratio level (x axis). Data points are colored according to the BLAST prediction based on the sequence: blue, predicted present in both the test and reference strains; red, predicted present in the test strain only; green, predicted present in the reference strain only. The cutoff threshold set by each method of analysis can be observed in relation to the distribution of the data and in particular the “indeterminate” or “marginal” zones located between clear gene presence and absence.

integrated plasmid, while strains N315, MW2, and MSSA476 each carry a free plasmid with some common genes. If we assume that the free plasmid may occur at a copy number greater than 1, this would explain why these genes appear from their hybridization signal ratios to be specific to the test strain (viz., blue spots among the red genes). Similarly, MRSA252 carries two copies of transposon Tn554, while N315 carries five copies, and these spots are colored blue but locate with genes colored red. Calculating microarray accuracy. To calculate microarray accuracy, it is necessary to decide what constitutes a gene that is “present” or “absent.” It can be seen from the scatter plots in Fig. 2 that a straight line cannot be drawn to separate the blue and red genes or to separate the blue and green genes. This is also illustrated in Fig. 3. The “indeterminate” or “marginal” zones have a mixture of colored spots, indicating a number of genes that by microarray hybridization analysis do not precisely fall into one or the other category of present or absent/divergent. If we want to calculate how accurate the microarray is at calling every gene “present” or “absent,” we need to find the best compromise and draw the “cutoff” lines. Our first step to approach this critical problem was to investi-

7509

gate current methods to categorize genes as present/absent by microarray. When microarrays are used for investigating gene expression, a cutoff of twofold to identify genes up- or down-regulated is often used. With our comparative genomics microarray, a twofold cutoff was used to generate lists of present or absent genes. When these lists were compared to the lists generated by sequence analysis (BLAST prediction), the microarray was calculated to be typically 75% sensitive and ⬎98% specific, as illustrated in Table 2, which shows the data for test strain MW2 (data for all sequenced strains are in Table S1 in the supplemental material). Therefore, the twofold cutoff method is useful but underestimates the number of genes that are truly different in the test strain compared to the reference strain. An advantage of such a conservative method is that false positives and false negatives are relatively low (PPV and NPV are high). We then used GACK analysis software (19) to classify the genes as present or absent/divergent in the test strain. GACK was originally designed to be used for one-way comparisons only, that is, for use on a microarray carrying genes from only one bacterial strain. To make use of it for identifying genes not found in the reference strain, we repeated the GACK analysis with inverse ratios. GACK binary output (50% EPP) identified more genes as absent/divergent than the twofold cutoff method and substantially increased the sensitivity of the microarray to 97% (see Table S1 in the supplemental material). However, there was a cost in the specificity of the microarray (see Table S1 in the supplemental material): there was a very high falsepositive rate, as indicated by the low PPV. The false positives were generally due to spots with low-intensity signals. At low signal intensity, a slight variation can have a proportionately large effect on the ratio, making the data for these genes less reliable. The 3SD method we describe here was based on calculating the standard deviation from the median of the ratios of a set of core genes and multiplying by 3 to generate the cutoff points. 3SD had the advantage of increasing the specificity and reducing false positives, but it slightly decreased the sensitivity. In Fig. 2, the 3SD cutoff lines are slightly farther away from the median than the GACK cutoff lines. The 3SD method was the most successful at categorizing genes as present or absent compared to BLAST analysis; by this method, the microarrays were, on average, 91% sensitive and 94% specific, with a PPV of 82% and an NPV of 92%, for the sequenced strains (see Table S1 in the supplemental material). Expressed more simply, the microarray correctly identified ⬎96.5% of genes present or absent/divergent in the sequenced strains compared

TABLE 2. Ability of data analysis methods to predict the presence or absence of genes by microarray compared to sequence prediction using MW2 dataa ORFs unique to the test strain

ORFs unique to the reference strain MRSA252

Method Sensitivity

Specificity

PPV

NPV

Sensitivity

Specificity

PPV

NPV

75.7 97.3 92.2

98.5 89.9 96.6

79.2 46.2 67.1

98.2 99.7 99.4

77.0 96.1 92.5

99.4 95.2 97.5

93.9 74.3 81.8

97.3 99.4 99.1

Twofold cutoff GACK 3SD a

All data are percentages.

7510

WITNEY ET AL.

to the reference strain MRSA252 (see Table S1 in the supplemental material). Both GACK and 3SD produced high false-positive rates. To identify possible reasons for this, we inspected the MW2 data using the 3SD interpretation method, which called 95 genes unique to MW2 by microarray when these genes were found in both genomes by BLAST. First, we identified 16 genes that could be called unique to MW2 because of elevated copy number. They are all found in the chromosome in MRSA252 (on an integrated plasmid or on a single integrated transposon), but in MW2 they are found on a free plasmid. Since free plasmids are self-replicating and multiple copies are usually found in any bacterial cell, enhanced copy number will mean that the signal intensities on the microarray will favor MW2. Indeed, virtually all of the 16 genes had intensity ratios greater than 2 and did not fall into the “marginal” zone. Thus, the microarray was not incorrect. An additional explanation for a low PPV is gene redundancy. The S. aureus genome has numerous well-documented examples of multiple genes (often in series) that are closely related, e.g., the superantigen gene nurseries (25). Our design strategy avoided most of these, and we note that no superantigen genes were falsely called by the microarray in this experiment. However, not every case of gene redundancy has been solved by our design strategy. For example, five putative lipoproteins, N315-0092 to -0096, were identified by BLAST as found in both MW2 and MRSA252, although MW2 carries five genes homologous to these products while MRSA252 carries only two genes homologous to these products. When we analyzed the two genomes by our BLAST method, these genes were found in both genomes (blue), although their bit score ratios bordered those found in MW2 only. The intensity ratios for four of these products are greater than 1.54 (and they fall among the red spots), and thus they contribute to the falsepositive rate. Six genes of the 95 MW2 false positives fall into this category. Finally, 11 false positives were particularly small genes, which generally give lower signal intensities. If the signal is low enough to be in the more variable region of the data set, the ratios generated from these genes become less reliable and harder to call as present/absent. If we remove the enhancedcopy-number, redundant, and small genes from our false-positive list, the PPV is recalculated from 67.1% to 75.5%. In other words, of the 253 genes called specific to MW2 by using 3SD, 62 were not MW2 specific by sequence analysis. The majority of these genes fall into the “marginal” zone with ratios between 1.54 and 2 and are difficult to categorize when a single cutoff line is chosen. Replicates. The use of replicates is mathematically proven to increase the accuracy of microarrays when used in gene expression studies (41). To prove this in practice for comparative genomic studies, the data generated for MW2 using four microarrays with duplicate spots analyzed separately (as if duplicate spots were two separate arrays), each of the four individual slides (averaging duplicate spots), the averages of two slides (paired according to when the microarrays were performed), and averaged data over all four microarrays (total of eight spots) were compared (full data are shown in Table S2 in the supplemental material). Data were interpreted using the 3SD method and compared to BLAST predictions. The results

APPL. ENVIRON. MICROBIOL.

show that microarray results are highly reproducible, and even a single slide is typically 85 to 97% sensitive and ⬎93% specific. Surprisingly, replicates increase the sensitivity and specificity of microarrays only marginally. The major advantage of replicates is the substantial decrease in false positives. This is indicated by PPV calculations improving from 54% (with one data set from one slide) to 75% (with eight data sets from four slides). Composite genomes. Most of the variation that occurs between isolates of S. aureus is due to the horizontal transfer or loss of MGEs. The seven sequencing projects revealed 5 plasmids, 13 bacteriophages, 6 SaPI, 6 SCC, 7 GI␣, 7 GI␤, 3 conjugative elements, and 3 transposons. Genes found in one type of MGE are generally not found in other types of MGE. However, within each MGE type, such as SaPI, recombination of SaPI within one strain and subsequent transfer of these rearranged SaPI to other strains at high frequency are probably common (25). Thus, a SaPI typically consists of a mosaic of fragments found in other SaPI. For example, the SaPI2 element from Mu50 consists of 12 genes: the first 3 are also found in SaPI2(N315), the next 4 are found in SaPI4(MRSA252), and the remaining 5 in SaPI2(N315). Yet it is also missing two genes found in SaPI2(N315). Since many MGEs (phage, SaPI, plasmids) have immune functions preventing related elements from transferring to the host bacterium (25), it can be important to identify which type of MGE is carried. Recombination presents problems for identifying MGEs, and thus composite genomes were used. To rapidly identify conserved MGEs or MGE mosaic regions, we generated a list of the PCR products from the microarray that best corresponded to each individual sequenced gene in order. These MGE lists we call “composite genomes.” Composite genomes were constructed for all 50 MGEs and for the seven sequenced chromosomal genomes. We use the composite genomes to readily display the presence or absence of colocalized genes in MGEs between strains by microarray. Applications of multistrain composite genomes for comparative genomics. A clinical isolate of MRSA from St George’s Hospital (strain PM11) (28) was compared to the reference strain MRSA252 by microarray analysis. Both strains belong to the epidemic MRSA-16 clonal group causing 50% of MRSA infections in the United Kingdom and found throughout the world, where it is also known as ST36, USA200. Figure 4A shows a scatter plot of PM11 compared to the reference strain. Spots are colored according to whether they are found in the reference strain sequence (blue) or not (red). Clearly, a group of genes is unique to PM11 and located above the twofold cutoff line, although there are no genes found in the reference strain that are missing in PM11. To visualize the unique PM11 genes more easily, composite genomes of the entire MRSA252 chromosome and three MGEs are displayed in Fig. 4B. As expected, the MRSA252 composite genome is virtually identical for the two isolates: all the genes found in MRSA252 are also found in PM11. Screening of the total of 50 MGE composite genomes identified 5 with positive signals. They are pMu50, ␾1(Mu50), ␾5(8325), SaPI2(Mu50), and SaPI2(N315). Composite genome results for pMu50, ␾1(Mu50), and SaPI2(N315) are presented (Fig. 4B). The sequenced plasmid from Mu50 contains a section of 14 genes that appear to be part of a transposon carrying ami-

VOL. 71, 2005

S. AUREUS SEVEN-STRAIN MICROARRAY

7511

FIG. 4. (A) Microarray hybridization intensity scatter plot of S. aureus PM11 (test strain) compared to S. aureus MRSA252 (sequenced reference strain). Data presented are averages of four replicates, and only spots flagged by ImaGene as present or marginal in all four replicates are presented. The data points are colored according to whether the PCR products are predicted to be found in MRSA252 by BLAST (blue) or not (red). The green lines represent the twofold cutoff. (B) Hybridization data from the same experiment presented as composite genomes. The top half represents the experiment where MRSA252 is compared to itself, the bottom half where PM11 is compared to MRSA252. Each color block represents a PCR product and is colored according to the normalized ratio, with color intensity dependent on the signal for MRSA252 (by use of GeneSpring software). The entire MRSA252 genome (2,741 PCR products) shows little difference between the two isolates; only those genes found in MRSA252 are in the composite genome, and PM11 is not missing any of these genes. Three of the MGE composite genomes with substantial differences between MRSA252 and PM11 are shown. They are a fragment of the plasmid found in Mu50 (30 PCR products) including genes found on a putative transposable element carrying aminoglycoside resistance, a fragment of the ␾1(Mu50) genome (68 PCR products), and a fragment of the SaPI2(N315) genome (14 PCR products) including the superantigen gene, tst. The figure is not drawn to scale.

noglycoside resistance genes. This section fluoresces in PM11 but not in MRSA252, suggesting that PM11 has acquired this transposon, which may or may not be on a plasmid in this strain. Both sequenced phage ␾1(Mu50) and ␾5(8325) have a common conserved mosaic region of 28 genes. At least five of these genes are strongly present in PM11 but not in MRSA252, suggesting that PM11 carries a novel phage. The sequenced SaPI2(Mu50) and SaPI2(N315) share 12 common genes, and 7 of them are found in PM11 but not MRSA252. The genes include tst (encoding toxic shock syndrome-1 toxin) and the SaPI2 integrase gene. This suggests that PM11 is carrying a novel SaPI2 element. Microarray analysis of six other EMRSA-16 isolates from St. George’s Hospital Medical School (SGHMS) (28) showed that they each carried unique assemblages of known MGEs and could all be distinguished from each other (data not shown). A second example of the use of composite genomes involved comparing the laboratory strain 8325-4 (test strain) with the sequenced strain 8325 (reference strain) (Fig. 5A). 8325-4 was originally derived from 8325 (also known as RN1) and is presumed to be missing three prophages (30). This experiment was performed differently from those discussed in the rest in this paper in that the standard MRSA252 was not used as the reference strain, and the two test strains were compared directly to each other. The data are presented as a composite genome of 8325, and they clearly show that 8325-4 is missing three prophages, ␾11, ␾12, and ␾13 [also known as ␾5(8325), ␾2(8325), and ␾3(8325), respectively]. Scatter plots revealed that this strain was not carrying any genes that are not found in 8325 (data not shown). Only by using this composite genome approach can the colocation of the genes of the three phages be visualized. When we visualize an S. aureus genome, we intuitively take into account both the genes that are present and the order they

are in. This is because all S. aureus genomes so far are syntenous (have a conserved gene order), the exception being the integration or loss of large MGEs such as bacteriophage, SaPI, and integrated plasmids. Therefore, if we display the data on a composite genome for a strain that lacks particular MGEs, we will be unable to observe any differences in these MGEs. However, if we display the data on a composite genome that contains these particular MGEs, we will be able to visualize which of those MGEs are absent in our test isolate. Thus, by displaying the microarray data for 8325-4 on the composite genome of 8325, we can see it is missing exactly three MGEs. A third application of the multistrain microarray composite genome approach is in the analysis of the acquisition of methicillin resistance genes, which convert S. aureus into a pathogen that is much more difficult to treat and is the cause of significant morbidity and mortality in hospitals. These genes have probably moved horizontally into S. aureus on at least five types of SCCmec element. Until now, identification of the SCCmec element type by PCR of multiple genes has been used to categorize MRSA isolates for epidemiological and evolutionary studies in the absence of more-discriminatory methods such as microarrays. In Fig. 5B, we show that microarrays are more than capable of identifying SCCmec and other SCC types by using composite genomes (here illustrated by the SCCmecII type). From genome sequence data we know that MRSA252, N315, and Mu50 all carry SCCmecII, and on the microarray all these isolates give a positive signal for the whole SCCmecII element. MW2 is known to carry SCCmecIV, and the only similarity between these elements are the mec and ccr genes; again, this can be seen using the composite genome display of the microarray hybridization profiles for these regions (Fig. 5B). COL carries an unclassified SCCmec type most closely related to SCCmecI and does not carry any genes associated with SCCmecII apart from the mec genes themselves; this is

7512

WITNEY ET AL.

APPL. ENVIRON. MICROBIOL.

FIG. 5. (A) Microarray data for the S. aureus laboratory strain 8325-4 compared to the S. aureus parental strain 8325 and visualized using the 8325 composite genome. The colors are the same as for Fig. 4. Data presented are averages of four replicates. The obvious deletion regions colored blue are bacteriophage and are labeled accordingly. (B) Composite genome of the SCCmecII region from MRSA252, showing microarray hybridization data for the seven sequenced strains. Some genes of interest are marked, including the ble/kn (bleO, kanR), mec (mecARI), Tn554, and ccr genes (ccrAB). From the sequencing projects, MRSA252, Mu50, and N315 are all known to be SCCmecII positive; MW2 carries SSCmecIV, which shares only the mec and ccr genes; COL carries an unclassified SCC which has only the mec genes conserved; MSSA476 carries an SCC with no genes common to SCCmecII; and 8325 is SCC negative. The results of the microarray match the sequencing results, as expected. The colors are the same as for Fig. 2, although one gene produced a poor-quality signal and is colored gray in three strains (marked with an asterisk).

also visualized by cohybridization signals with some genes in the composite genome display. Neither MSSA476 (which carries an SCC element without mec and without homology to SCCmecII) nor 8325 (which is mec and SCC negative) shows a microarray signal for the SCCmecII composite genome. Mosaic sections of the SCC are marked on Fig. 5B, including the mecARI genes, involved in methicillin resistance; the conserved ccrAB genes in SCCmecII and IV, which are thought to be involved in horizontal transfer; bleomycin and kanamycin resistance genes; and transposon Tn554, carrying resistance to erythromycin. Similarly, composite genomes of other SCC types provided further information in support of SCCmec type classification (data not shown). DISCUSSION We have successfully designed, built, and validated a microarray representing every predicted gene from the seven S. aureus whole-genome sequencing projects. The microarray is extremely accurate at predicting gene differences between strains and is highly reproducible. The multistrain genome coverage of the microarray allows it to be used to investigate

the horizontal acquisition or loss of MGEs in related strains, and it is by far the most comprehensive method of identifying relationships between strains of S. aureus and subtle differences between clonal isolates. The design method was developed to produce an essentially nonredundant microarray such that most genes would be represented only once; this was possible because the core genome appears highly conserved in all S. aureus strains. The advantage of using a nonredundant microarray is that a single strain can be used to provide reference DNA. The choice of reference DNA is not straightforward for a multistrain microarray. Others have tried using ratios of chromosomal preparations of all strains in order to generate a control signal for every spot on the microarray. However, this leads to increased signal intensity of core gene spots in the reference channel and complicates data analysis based on ratios. Another option is to use a mixture of PCR products generated using the same primers as those used to construct the microarray, although this approach would be technically demanding and poorly reproducible. Our results suggest that using a chromosomal preparation of a single genome was appropriate.

VOL. 71, 2005

Interpretation of data using a multistrain microarray is complex, and many research groups have struggled to identify exactly which genes are present or absent in a strain by using microarray data for other bacterial species. The major problem is genes that fall into the “marginal” zone, that is, somewhere between the “present” and “absent” regions. Some groups have chosen to ignore these data altogether. For the first time we have been able to independently assess the methods commonly used for categorizing genes as present or absent with a substantial number of sequenced genomes. A twofold cutoff method was shown to be more conservative than the GACK or 3SD method and so may miss many important differences. The GACK and 3SD methods give greater sensitivity and specificity at the risk of a higher number of false positives. The method of choice, therefore, would depend on the importance of these factors to the experimental design. The development of better methods for determining cutoffs is warranted. Our results show the impact of replicates on microarray results and support the use of additional replicates to minimize false positives. However, because of the cost of microarrays, the choice of replicate numbers in any large-scale comparative genomics experiments will depend on balancing the increased cost for a high number of replicates against the disadvantage of an increased false-positive rate for a low number of replicates. Depending on the intended application, a single microarray slide would provide useful and reproducible data more suited to a global comparison of a large number of strains. For a more precise determination of presence or absence on a gene-bygene basis, a higher number of replicates would be required to achieve a more acceptable false-positive rate. The seven sequenced strains are thought to offer reasonable coverage of human clinical strains, and they include a wide range of MGEs. However, our microarray cannot include every possible S. aureus gene. The “missing” genes are expected to be carried on MGEs. Using our composite genome approach, we can identify the presence of novel MGEs in unsequenced strains, which will enable targeted sequencing of these elements and even more comprehensive microarrays in the future. Using three biological examples of genome variation, we have proven the value of the seven-strain microarray and its advantages over a microarray based on a single genome. Microarrays are clearly a very valuable method for discriminating between related and unrelated isolates and can be used for typing isolates in epidemiological studies, for identifying genes associated with particular groups of isolates (such as virulent versus nonvirulent isolates), and for studying the horizontal transfer of MGEs and the evolution of S. aureus. In particular, we show here that “clonal” isolates of epidemic MRSA-16 can differ significantly in their carriage of MGEs. These MGEs can include genes encoding novel resistance or virulence factors, potentially impacting on treatment and pathogenesis. The frequency of MGE transfer and the stability of these genes are unknown, but these data support our previous studies suggesting that such rearrangements in S. aureus isolates in the hospital environment are common (27). Whether different isolates of the same clone are equally pathogenic remains to be determined. But the continuing accumulation of virulence and resistance genes in epidemic MRSA clones will lead to strains that are increasingly problematic in

S. AUREUS SEVEN-STRAIN MICROARRAY

7513

the hospital and community environments (25) and should be monitored. In addition, the substantial variation between clonal isolates of MRSA could be exploited for epidemiological typing. Since many hospitals are endemically affected by only a few clonal types, an enhanced typing method will allow identification and tracking of outbreaks both within and between hospitals. In summary, the construction of a seven-strain S. aureus microarray has presented unique challenges in design, use of reference DNA, interpretation, and visualization of data. Our results suggest that we have developed suitable methods, including the concept of composite genomes, that could be used to generate and use multistrain PCR product microarrays for other bacterial species, particularly those with substantial genome plasticity. ACKNOWLEDGMENTS We acknowledge The Wellcome Trust for funding the multicollaborative microbial pathogen microarray facility under its Functional Genomics Resources Initiative. The S. aureus microarray is available on a collaborative basis for other researchers to use through the Bacterial Microarray Group at SGHMS (B␮G@S) (http://bugs .sghms.ac.uk). Preliminary sequence data for S. aureus strain COL were obtained from the website of The Institute for Genomic Research (http://www.tigr.org), and sequencing was accomplished with support from NIAID and MGRI. Preliminary sequence data for S. aureus strain 8325 were obtained from the Staphylococcus aureus Genome Sequencing Project at the University of Oklahoma website (www .genome.ou.edu/staph.html), and sequencing was accomplished with funding from the NIH and the Merck Genome Research Institute.

ADDENDUM IN PROOF The annotated genome of S. aureus COL is not published (S. R. Gill, D. E. Fouts, G. L. Archer, E. F. Mongodin, R. T. Deboy, J. Ravel, I. T. Paulsen, J. F. Kolonay, L. Brinkac, M. Beanan, R. J. Dodson, S. C. Daugherty, R. Madupu, S. V. Angiuoli, A. S. Durkin, D. H. Haft, J. Vamathevan, H. Khouri, T. Utterback, C. Lee, G. Dimitrov, L. Jiang, H. Qin, J. Weidman, K. Tran, K. Kang, I. R. Hance, K. E. Nelson, and C. M. Fraser, J. Bacteriol. 187:2426–2438, 2005). REFERENCES 1. Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. L. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402. 2. Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic Local Alignment Search Tool. J. Mol. Biol. 215:403–410. 3. Baba, T., F. Takeuchi, M. Kuroda, H. Yuzawa, K. Aoki, A. Oguchi, Y. Nagai, N. Iwama, K. Asano, T. Naimi, H. Kuroda, L. Cui, K. Yamamoto, and K. Hiramatsu. 2002. Genome and virulence determinants of high virulence community-acquired MRSA. Lancet 359:1819–1827. 4. Behr, M. A., M. A. Wilson, W. P. Gill, H. Salamon, G. K. Schoolnik, and P. M. Small. 1999. Comparative genomics of BCG vaccines by whole-genome DNA microarray. Science 284:1520–1523. 5. Bjorkholm, B., A. Lundin, A. Sillen, K. Guillemin, N. Salama, C. Rubio, J. I. Gordon, P. Falk, and L. Engstrand. 2001. Comparison of genetic divergence and fitness between two subclones of Helicobacter pylori. Infect. Immun. 69:7832–7838. 6. Chan, K., S. Baker, C. C. Kim, C. S. Detweiler, G. Dougan, and S. Falkow. 2003. Genomic comparison of Salmonella enterica serovars and Salmonella bongori by use of an S. enterica serovar typhimurium DNA microarray. J. Bacteriol. 185:553–563. 7. Chen, T., Y. Hosogi, K. Nishikawa, K. Abbey, R. D. Fleischmann, J. Walling, and M. J. Duncan. 2004. Comparative whole-genome analysis of virulent and avirulent strains of Porphyromonas gingivalis. J. Bacteriol. 186:5473–5479. 8. Cummings, C. A., M. M. Brinig, P. W. Lepp, S. van de Pas, and D. A. Relman. 2004. Bordetella species are distinguished by patterns of substantial gene loss and host adaptation. J. Bacteriol. 186:1484–1492.

7514

WITNEY ET AL.

9. Delcher, A. L., D. Harmon, S. Kasif, O. White, and S. L. Salzberg. 1999. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27:4636–4641. 10. Dorrell, N., J. A. Mangan, K. G. Laing, J. Hinds, D. Linton, H. Al-Ghusein, B. G. Barrell, J. Parkhill, N. G. Stoker, A. V. Karlyshev, P. D. Butcher, and B. W. Wren. 2001. Whole genome comparison of Campylobacter jejuni human isolates using a low-cost microarray reveals extensive genetic diversity. Genome Res. 11:1706–1715. 11. Dunman, P. M., W. Mounts, F. McAleese, F. Immermann, D. Macapagal, E. Marsilio, L. McDougal, F. C. Tenover, P. A. Bradford, P. J. Petersen, S. J. Projan, and E. Murphy. 2004. Uses of Staphylococcus aureus GeneChips in genotyping and genetic composition analysis. J. Clin. Microbiol. 42:4275– 4283. 12. Dziejman, M., E. Balon, D. Boyd, C. M. Fraser, J. F. Heidelberg, and J. J. Mekalanos. 2002. Comparative genomic analysis of Vibrio cholerae: genes that correlate with cholera endemic and pandemic disease. Proc. Natl. Acad. Sci. USA 99:1556–1561. 13. Fitzgerald, J. R., D. E. Sturdevant, S. M. Mackie, S. R. Gill, and J. M. Musser. 2001. Evolutionary genomics of Staphylococcus aureus: insights into the origin of methicillin-resistant strains and the toxic shock syndrome epidemic. Proc. Natl. Acad. Sci. USA 98:8821–8826. 14. Frishman, D., A. Mironov, H. W. Mewes, and M. Gelfand. 1998. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res. 26:2941–2947. 15. Fukiya, S., H. Mizoguchi, T. Tobe, and H. Mori. 2004. Extensive genomic diversity in pathogenic Escherichia coli and Shigella strains revealed by comparative genomic hybridization microarray. J. Bacteriol. 186:3911–3921. 16. Hinchliffe, S. J., K. E. Isherwood, R. A. Stabler, M. B. Prentice, A. Rakin, R. A. Nichols, P. C. F. Oyston, J. Hinds, R. W. Titball, and B. W. Wren. 2003. Application of DNA microarrays to study the evolutionary genomics of Yersinia pestis and Yersinia pseudotuberculosis. Genome Res. 13:2018–2029. 17. Holden, M. T. G., E. J. Feil, J. A. Lindsay, S. J. Peacock, N. P. Day, M. C. Enright, T. J. Foster, C. E. Moore, L. Hurst, R. Atkin, A. Barron, N. Bason, S. D. Bentley, C. Chillingworth, T. Chillingworth, C. Churcher, L. Clark, C. Corton, A. Cronin, J. Doggett, L. Dowd, T. Feltwell, Z. Hance, B. Harris, H. Hauser, S. Holroyd, K. Jagels, K. D. James, N. Lennard, A. Line, R. Mayes, S. Moule, K. Mungall, D. Ormond, M. A. Quail, E. Rabbinowitsch, K. Rutherford, M. Sanders, S. Sharp, M. Simmonds, K. Stevens, S. Whitehead, B. G. Barrell, B. G. Spratt, and J. Parkhill. 2004. Complete genomes of two clinical Staphylococcus aureus strains: evidence for the rapid evolution of virulence and drug resistance. Proc. Natl. Acad. Sci. USA 101:9786–9791. 18. Kato-Maeda, M., J. T. Rhee, T. R. Gingeras, H. Salamon, J. Drenkow, N. Smittipat, and P. M. Small. 2001. Comparing genomes within the species Mycobacterium tuberculosis. Genome Res. 11:547–554. 19. Kim, C. C., E. A. Joyce, K. Chan, and S. Falkow. 2002. Improved analytical methods for microarray-based genome-composition analysis. Genome Biol. 3:RESEARCH0065. 20. Kluytmans, J., A. van Belkum, and H. Verbrugh. 1997. Nasal carriage of Staphylococcus aureus: epidemiology, underlying mechanisms, and associated risks. Clin. Microbiol. Rev. 10:505–520. 21. Koide, T., P. A. Zaini, L. M. Moreira, R. Z. N. Vencio, A. Y. Matsukuma, A. M. Durham, D. C. Teixeira, H. El-Dorry, P. B. Monteiro, A. C. R. da Silva, S. Verjovski-Almeida, A. M. da Silva, and S. L. Gomes. 2004. DNA microarray-based genome comparison of a pathogenic and a nonpathogenic strain of Xylella fastidiosa delineates genes important for bacterial virulence. J. Bacteriol. 186:5442–5449. 22. Kuroda, M., T. Ohta, I. Uchiyama, T. Baba, H. Yuzawa, I. Kobayashi, L. Cui, A. Oguchi, K. Aoki, Y. Nagai, J. Lian, T. Ito, M. Kanamori, H. Matsumaru, A. Maruyama, H. Murakami, A. Hosoyama, Y. Mizutani-Ui, N. K. Takahashi, T. Sawano, R. Inoue, C. Kaito, K. Sekimizu, H. Hirakawa, S. Kuhara, S. Goto, J. Yabuzaki, M. Kanehisa, A. Yamashita, K. Oshima, K. Furuya, C. Yoshino, T. Shiba, M. Hattori, N. Ogasawara, H. Hayashi, and K. Hiramatsu. 2001. Whole genome sequencing of methicillin-resistant Staphylococcus aureus. Lancet 357:1225–1240. 23. Leonard, E. E., II, L. S. Tompkins, S. Falkow, and I. Nachamkin. 2004.

APPL. ENVIRON. MICROBIOL.

24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37.

38.

39.

40.

41. 42.

Comparison of Campylobacter jejuni isolates implicated in Guillain-Barre syndrome and strains that cause enteritis by a DNA microarray. Infect. Immun. 72:1199–1203. Lindsay, J. A., A. Ruzin, H. F. Ross, N. Kurepina, and R. P. Novick. 1998. The gene for toxic shock toxin is carried by a family of mobile pathogenicity islands in Staphylococcus aureus. Mol. Microbiol. 29:527–543. Lindsay, J. A., and M. T. G. Holden. 2004. Staphylococcus aureus: superbug, super genome? Trends Microbiol. 12:378–385. Mei, J. M., F. Nourbakhsh, C. W. Ford, and D. W. Holden. 1997. Identification of Staphylococcus aureus virulence genes in a murine model of bacteraemia using signature-tagged mutagenesis. Mol. Microbiol. 26:399–407. Moore, P. C., and J. A. Lindsay. 2001. Genetic variation among hospital isolates of methicillin-sensitive Staphylococcus aureus: evidence for horizontal transfer of virulence genes. J. Clin. Microbiol. 39:2760–2767. Moore, P. C. L., and J. A. Lindsay. 2002. Molecular characterisation of the dominant UK methicillin-resistant Staphylococcus aureus strains, EMRSA-15 and EMRSA-16. J. Med. Microbiol. 51:516–521. Nakamura, Y., T. Gojobori, and T. Ikemura. 2000. Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res. 28:292. Novick, R. P. 1967. Properties of a cryptic high-frequency transducing phage in Staphylococcus aureus. Virology 33:155–166. Pearson, W. R., and D. J. Lipman. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85:2444–2448. Porwollik, S., J. Frye, L. D. Florea, F. Blackmer, and M. McClelland. 2003. A non-redundant microarray of genes for two related bacteria. Nucleic Acids Res. 31:1869–1876. Porwollik, S., R. M.-Y. Wong, and M. McClelland. 2002. Evolutionary genomics of Salmonella: gene acquisitions revealed by microarray analysis. Proc. Natl. Acad. Sci. USA 99:8956–8961. Rozen, S., and H. Skaletsky. 2000. Primer3 on the WWW for general users and for biologist programmers. Methods Mol. Biol. 132:365–386. (Code available at http://fokker.wi.mit.edu/primer3/.) Rutherford, K., J. Parkhill, J. Crook, T. Horsnell, P. Rice, M. A. Rajandream, and B. Barrell. 2000. Artemis: sequence visualization and annotation. Bioinformatics 16:944–945. Salama, N., K. Guillemin, T. K. McDaniel, G. Sherlock, L. Tompkins, and S. Falkow. 2000. A whole-genome microarray reveals genetic diversity among Helicobacter pylori strains. Proc. Natl. Acad. Sci. USA 97:14668–14673. Saunders, N. A., A. Underwood, A. M. Kearns, and G. Hallas. 2004. A virulence-associated gene microarray: a tool for investigation of the evolution and pathogenic potential of Staphylococcus aureus. Microbiology 150: 3763–3771. Smoot, J. C., K. D. Barbian, J. J. Van Gompel, L. M. Smoot, M. S. Chaussee, G. L. Sylva, D. E. Sturdevant, S. M. Riclefs, S. F. Porcella, L. D. Parkins, S. B. Beres, D. S. Campbell, T. M. Smith, Q. Zhang, V. Kapur, J. A. Daly, L. G. Veasy, and J. M. Musser. 2002. Genome sequence and comparative microarray analysis of serotype M18 group A Streptococcus strains associated with acute rheumatic fever outbreaks. Proc. Natl. Acad. Sci. USA 99:4668– 4673. Snyder, L. A. S., J. K. Davies, and N. J. Saunders. 2004. Microarray genomotyping of key experimental strains of Neisseria gonorrhoeae reveals gene complement diversity and five new neisserial genes associated with minimal mobile elements. BMC Genomics 5:23. Thomson, N., S. Baker, D. Pickard, M. Fookes, M. Anjum, N. Hamlin, J. Wain, D. House, Z. Bhutta, K. Chan, S. Falkow, J. Parkhill, M. Woodward, A. Ivens, and G. Dougan. 2004. The role of prophage-like elements in the diversity of Salmonella enterica serovars. J. Mol. Biol. 339:279–300. Wernisch, L., S. L. Kendall, S. Soneji, A. Wietzorrek, T. Parish, J. Hinds, P. D. Butcher, and N. G. Stoker. 2003. Analysis of whole-genome microarray replicates using mixed models. Bioinformatics 19:53–61. Wolfgang, M. C., B. R. Kulasekara, X. Liang, D. Boyd, K. Wu, Q. Yang, C. G. Miyada, and S. Lory. 2003. Conservation of genome content and virulence determinants among clinical and environmental isolates of Pseudomonas aeruginosa. Proc. Natl. Acad. Sci. USA 100:8484–8489.