Short Technical Reports High correspondence ... - BioTechniques

5 downloads 24938 Views 339KB Size Report
between exon and standard expression arrays, this would ... National Center for Biotechnology. Information .... standard detection calls for the Plus2 arrays.
Short Technical Reports High correspondence between Affymetrix exon and standard expression arrays Michał J. Okoniewski, Yvonne Hey, Stuart D. Pepper, and Crispin J. Miller The Paterson Institute for Cancer Research, Manchester, UK BioTechniques 42:181-185 (February 2007) doi 10.2144/000112315

Exon arrays aim to provide comprehensive gene expression data at the level of individual exons, similar to that provided on a per-gene basis by existing expression arrays. This report describes the performance of Affymetrix GeneChip® Human Exon 1.0 ST array by using replicated RNA samples from two human cell lines, MCF7 and MCF10A, hybridized both to Exon 1.0 ST and to HG-U133 Plus2 arrays. Cross-comparison between array types requires an appropriate mapping to be found between individual probe sets. Three possible mappings were considered, reflecting different strategies for dealing with probe sets that target different parts of the same transcript. Irrespective of the mapping used, Exon 1.0 ST and HG-U133 Plus2 arrays show a high degree of correspondence. More than 80% of HG-U133 Plus2 probe sets may be mapped to the Exon chip, and fold changes are found well preserved for over 96% of those probe sets detected present. Since HG-U133 Plus2 arrays have already been extensively validated, these results lend a significant degree of confidence to exon arrays.

INTRODUCTION The GeneChip® Human Exon 1.0 ST array contains approximately 5.5 million probes, forming 1.4 million probe sets that are together used to separately interrogate 1 million known and predicted exons; the aim is to comprehensively cover the entire human genome at the exon scale. Exon arrays offer a more fine-grained view of gene expression than the current generation of chips and have the potential to support global inferences about gene expression at the level of individual isoforms and exons, rather than on the per-gene basis offered with existing approaches. Exon arrays are probably the most radically changed generation of GeneChip microarrays and promise to be a powerful technology, given that a significant proportion of human genes are predicted to be differentially spliced (1–3). In particular, 74% of multi-exon genes are estimated to be alternatively spliced (2). Such a high target density has been achieved through a variety of changes to the hardware platform, to the design of the array itself, and to the chemistry used to prepare the samples Vol. 42 ı No. 2 ı 2007

for hybridization. Thus, not only are there about six times as many features as the previous generation of chips, their probe set count has been further increased by no longer having a paired mismatch probe for each perfect match partner and by reducing the number of oligonucleotide probes per probe set from 11 to 4 (4). These changes in array design also make many of the existing data analysis methods obsolete. It is not possible, for example, to use either the MAS5 expression summary or detection calling (5,6) algorithms, since there are no paired mismatch probes; instead Affymetrix provide a new algorithm, probe logarithmic intensity error (plier), and a new method, detection above background (DABG), for assessing the reliability for each probe set (7). With so many changes to the underlying technology, it is important to develop an understanding of exon arrays similar to that which has been developed for current chips. These have been comprehensively explored using controlled data sets (8,9), and the literature contains numerous studies in which candidate genes have been validated using alternate approaches,

such as quantitative PCR or protein expression, and in which hypotheses generated from microarray data have been successfully pursued through to a biological conclusion. There is, therefore, a sizable body of data confirming the validity of Affymetrix microarray data (10–13). If a significant degree of mutual consistency is found between exon and standard expression arrays, this would provide significant evidence in favor of their reliability. Thus, the purpose of this report is to consider, using replicated data from two cell lines, MCF7 and MCF10A, the levels of reproducibility between HG-U133 Plus2 human microarrays and Exon 1.0 ST arrays. Fundamental to the analysis is the need to analyze the available mappings between the probe sets on the different chips. Since one of the prime motivations behind the development of exon arrays is that different parts of a gene can be expressed in different ways in different samples, any comparison must consider the exact location of individual probe sets relative to the target genes’ structure. Thus the success or failure of any such analysis is likely to be at least in part governed by the annotation used to define the mapping (Figure 1). This is not straightforward, because the array structure is sufficiently complex for there to be an absence of a unique and universal one-one mapping between probe sets. This paper explores three possibilities, the consensus and target (or SIF) mappings, both supplied by Affymetrix, and an alternate approach (referred to here as the Chip Definition File, or CDF, mapping) based on the National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) transcripts (14). A short description of the different array design strategies (15,16) is necessary in order to identify their key differences. Briefly, probe sets on the Plus2 array were designed against a variety of public resources including UniGene (17), GenBank® (18), the expressed sequence tags database (dbEST) (19), and RefSeq (20). Part of the annotation process involved generating clusters of sequences around each gene and computing alignments within these clusters. Choices have to be made where, for example, sequences www.biotechniques.com ı BioTechniques ı 181

Short Technical Reports

within a cluster are of different lengths or quality, or there are discrepancies in the residue called at a particular point. Clusters are thus represented by consensus or exemplar sequences that reflect the end result of these decisions. These sequences are typically long (often full-length messenger RNA or mRNA) and require additional constraints, such as the proximity of a poly(A) site, to select appropriate regions against which to design each probe set. These shorter probe selection regions are known as SIF or target sequences. By contrast, exon arrays were designed by selecting a diverse set of genomic annotations, including primary sequence and gene prediction sets, and projecting these data onto the assembled genome. The resultant sequences were grouped and used to infer exon, intron, and intergenic regions that were subsequently refined further to select appropriate target sequences against which to design probes. There are also significantly fewer probe sets on a Plus2 array, and the majority of these are targeted toward the 3′ end of the transcript. Affymetrix supplies two possible cross-chip mappings based on the consensus and SIF sequences used to generate the Plus2 array. In the SIF mapping file, a correspondence is recorded when a probe set from each array is fully included within the same SIF sequence—a similar approach is also used to build mappings via the consensus sequences. A number of alternate annotation strategies have also been developed (14,21), in part because the continual growth of the public domain databases has resulted in a progressive maturation of the sequence data upon which annotation is based and in part because different annotations can be used to emphasize different features in the data. Thus, we also consider the RefSeq mappings (14), as RefSeq is known to be a nonredundant and relatively complete database of transcripts (20). Here, all probes, irrespective of the original probe set for which they are a member, are brought together to form alternative probe sets based on the RefSeq transcripts they match in an in silico search. 182 ı BioTechniques ı www.biotechniques.com

HG-U133 Plus 2 probe sets

Exon 1.0 ST probe sets Consensus sequence

Transcript structure 5�

Consensus mapping

probe set 133 probe Exon probe

SIF mapping

CDF mapping

Figure 1. Relationship between probes, probe sets, and transcripts. The additional complexity of exon arrays means that there is not a clear one-one mapping between probe sets on the two arrays. Three alternate mappings (consensus, target or SIF, and Chip Definition File or CDF) are considered, as described in the text.

MATERIALS AND METHODS

Initial Data Processing

RNA Preparation

For the two (SIF and consensus) mappings, data from the exon array were processed using the Exon Array Computational Tool (ExACT™; Affymetrix) implementation of plier (normalization by group: cell line, quantile method, summarization: standard pgf and clf files, antigenomic background, PlierDifferentialTargetP enalty = 0.1, consensus and SIF metaprobe sets files, respectively) prior to being loaded and analyzed in R and BioConductor (22–24). HG-U133 Plus2 arrays were processed with the plier BioConductor library (PM probes only, quantile normalization and concpenalty = 0.1). Detection calls (25) were calculated in BioConductor for the HG-U133 Plus2 arrays; DABG values from ExACT were used for the Exon arrays. The CDF mapping (14) uses probe set definitions based on a de novo mapping against RefSeq transcripts.

RNA from MCF7 and MCF10A samples was labeled and hybridized in triplicate (technical replicates) to HG-U133 Plus2 arrays and Human Exon 1.0 ST arrays (both from Affymetrix, Santa Clara, CA, USA). MCF7 cells were grown in Dulbecco’s modified Eagle’s medium (DMEM) with 10% (v/v) fetal calf serum (FCS; Invitrogen, Carlsbad, CA, USA), and MCF10A was grown in DMEM/F12 with 5% (v/v) horse serum (Invitrogen, Paisley, UK), 2 ng/mL epidermal growth factor (PeproTech, Rocky Hill, NJ, USA), 0.5 g/mL hydrocortisone, 0.5 g/mL cholera toxin, and 5 g/mL insulin (Sigma, St. Louis, MO, USA). Full hybridization protocols description for both Plus2 and Exon arrays can be found in the supplementary material available online at www.BioTechniques.com.

Vol. 42 ı No. 2 ı 2007

Short Technical Reports B

C

HG-U133 Plus2

A

Exon Consensus

Exon Sif

CDF Exon

Figure 2. Fold changes between MCF7 and MCF10A samples found using each array type. x-axis, Plus2 arrays; y-axis, Exon arrays. Each point corresponds to a successful mapping between probe sets on each array type. (A) Consensus mapping, (B) target or SIF mapping, and (C) Chip Definition File (CDF) mapping. Black squares, probe sets consistently flagged Present on both sets of arrays. Grey crosses, probe sets flagged Absent on one or more arrays. The diagonal lines represent log2(2), log2(4), log2(6), log2(8) differences in the fold change.

These are stored within alternative CDF environments. The data was first loaded into BioConductor, and the CDF environments (hs133phsrefseq7cdf and hsex10stv2hsrefseq7cdf, respectively) were substituted (21) before the data were processed. Fold Changes, Significant Probe Sets, Detection Filtering Two strategies were adopted. In Figure 2, fold changes between the log2 mean values for the MCF7 and MCF10A replicates were calculated independently for the Exon and Plus2 arrays. Each point corresponds to a pair of probe sets for which a successful cross-chip mapping could be found. Ideally, log fold changes (called fold changes throughout the paper) would be identical, and all points would fall on the major diagonal of each scatter plot. The percentage of probe sets for which the difference between fold changes on both platforms is smaller than 1 (on a log scale)—referred to as corresponding probe sets. It provides a metric for the overall quality of correspondence between array types. Reliability calls were used to partition the data. For the two Affymetrix mappings, DABG scores were produced for the Exon chips and standard detection calls for the Plus2 arrays. For the CDF mapping, DABG scores were unavailable, and detection calls could not be calculated for the Exon arrays due to the unavailability of Vol. 42 ı No. 2 ı 2007

mismatch (MM) probes. Thus, filtering was performed solely on the HG-U133 Plus2 data. The second approach used significance analysis of microarrays (SAM), empirical Bayes analyses of microarrays (EBAM), and empirical Bayes (eBayes) (as implemented in BioConductor by the siggenes and limma packages), popular methods for selecting genes based on significance (26–28). For all three cross-chip mappings, the Δ

parameter of SAM was selected to generate gene lists of the same size (100 and 1000 top probe sets were used for comparisons). Given the relatively small number of replicates in the experiment, the estimated false discovery rate (FDR) for each of these sets is unlikely to be reliable and is not reported. Table 1 shows the number of overlapping probe sets found between the gene lists generated for each array-type. Results using EBAM and eBayes are similar

Table 1. Summary of Array Comparisons Filtering

Probe Sets

SAM Common Probe Sets

Consistent

Consistent (%)

Consensus

No

44280

528

40683

91.9

Consensus

Pexon

27723

568

25657

92.5

Consensus

P133

15965

732

15497

97.1

Consensus

Both

14657

745

14345

97.9

SIF

No

13730

524

12720

92.6

SIF

Pexon

5695

633

5400

94.8

SIF

P133

4242

662

4057

95.6

SIF

Both

3516

714

3429

97.5

CDF

No

26405

581

24692

93.5

CDF

P133

12481

729

12101

96.9

Mapping

Significance analysis of microarrays (SAM) and fold change calculations were performed on the Plus2 and Exon arrays independently. The column SAM gives the number of probe sets shared between the two sets. These data were generated for the subset of probe sets successfully mapped between arrays using each of the three cross-chip mappings (consensus, target or SIF, and Chip Definition File or CDF), described in the text. The mapping is specified by the column, Mapping. Data were also filtered by detection call (Filtering). Four possible filterings were considered: No, no filtering; Pexon, present on all six Exon arrays; P133, present on all six Plus2 arrays; Both, present on all arrays. Probe Sets reports the number of probe sets remaining following filtering. Consistent and Consistent (%) show the number and percentage of probe sets that have 70% overlap is found between platforms for all three mappings, comparable to that found between the different expression summary algorithms. Note that the estimated FDR was not considered to be accurate given the relatively small number of replicates in the experiment (but ranged from 3.9% to 6.8%) and is thus not reported in Table 1. Similar results were also found when the analysis was performed using the top 100, rather than the top 1000, probe sets and when EBAM and eBayes were used instead to identify significant genes (see Supplementary Tables S1–S3). A related data set, generated using human tissue samples (provided by Affymetrix; downloadable from the support section of their web site) is also available. The same analyses were performed on these data (see Supplementary Figures S1 and S2). Although more variable, probably reflecting the increased variation expected from tissue samples as compared with cell line data, correspondence was still good. Three alternative cross-chip mappings were considered. The most conservative, via SIF sequences, only records a mapping when an exon probe set targets the same short sequence as a Plus2 probe set (see Figure 1), while the consensus and CDF mappings include probe sets along the full length of a transcript. Interestingly, given the differences in mapping strategies, all three approaches performed well, with very little quantitative difference (for the consistently Present probe sets), either in the proportion of outliers, the FDR,

or the overlap between gene lists. One major difference between mappings is the number of mapped probes, with the consensus mapping containing approximately three times as many correspondences as the SIF one. When the less consistent data (i.e., the gray points in Figure 2) are considered, more variation is seen for the consensus and CDF mappings. This is to be expected, given that these probe sets offer summaries across the full length of the transcript; many transcripts will exhibit differential expression along their length, and probe sets can also contain flanking regions, untranslated regions (UTRs), and possibly, incorrectly predicted exons. Thus, the implication is that much of the variability seen in these data (and correctly identified as variable by the DABG calling strategy) is due to real biology and a consequence of summarizing a complex set of data into a single consensus value for the entire transcript. This would be of concern if these mappings were the only option available for analyzing the data, but, of course, this is not the case. The standard probe set definitions for exon arrays (rather than the cross-chip mappings described here) offer a set of probe sets targeting individual exons, and it is these definitions that provide an appropriate level of granularity. The high degree of correspondence seen between Exon 1.0 ST and HGU133 Plus2 arrays for probe sets with consistent signal strength, mapping to the same genomic region, allied with the fact that the current generation of expression arrays have already been repeatedly validated experimentally, provides strong evidence supporting the reliability of exon arrays, not only for probe sets that can be successfully mapped to the existing arrays, but also for the many thousands of additional probe sets that provide more detailed coverage of the transcriptome. ACKNOWLEDGMENTS

This work was funded by Cancer Research UK. Carla Möller-Levet took part in some highly informative discussions, and Rob Clarke provided RNA samples for hybridization. Vol. 42 ı No. 2 ı 2007

Short Technical Reports

COMPETING INTERESTS STATEMENT

The authors declare no competing interests. REFERENCES 1. Mironov, A.A., J.W. Fickett, and M.S. Gelfand. 1999. Frequent alternative splicing of human genes. Genome Res. 9:1288-1293. 2. Johnson, J.M., J. Castle, P. Garrett-Engele, Z. Kan, P.M. Loerch, C.D. Armour, R. Santos, E.E. Schadt, et al. 2003. Genomewide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302:2141-2144. 3. Shai, O., Q.D. Morris, B.J. Blencowe, and B.J. Frey. 2006. Inferring global levels of alternative splicing isoforms using a generative model of microarray data. Bioinformatics 22:606-613. 4. Affymetrix. 2005. Exon probe set annotations and transcript cluster groupings. Affymetrix, Santa Clara, CA. 5. Hubbell, E., W.-M. Liu, and R. Mei. 2002. Robust estimators for expression analysis. Bioinformatics 18:1585-1592. 6. Liu, W.-m., R. Mei, X. Di, T.B. Ryder, E. Hubbell, S. Dee, T.A. Webster, C.A. Harrington, et al. 2002. Analysis of high density expression microarrays with signedrank call algorithms. Bioinformatics 18:15931599. 7. Affymetrix. 2005. Guide to probe logarithmic intensity error (plier) estimation. Technical note. Affymetrix, Santa Clara, CA. 8. Held, G.A., G. Grinstein, and Y. Tu. 2006. Relationship between gene expression and observed intensities in DNA microarrays—a modeling study. Nucleic Acids Res. 34:e70. 9. Irizarry, R.A., Z. Wu, and H.A. Jaffee. 2006. Comparison of Affymetrix GeneChip expression measures. Bioinformatics 22:789794. 10. Nimgaonkar, A., D. Sanoudou, A. Butte, J. Haslett, L. Kunkel, A. Beggs, and I. Kohane. 2003. Reproducibility of gene expression across generations of Affymetrix microarrays. BMC Bioinformatics 4:27. 11. Hwang, K.-B., S. Kong, S. Greenberg, and P. Park. 2004. Combining gene expression data from different generations of oligonucleotide arrays. BMC Bioinformatics 5:159. 12. Elo, L.L., L. Lahti, H. Skottman, M. Kylaniemi, R. Lahesmaa, and T. Aittokallio. 2005. Integrating probe-level expression changes across generations of Affymetrix arrays. Nucleic Acids Res. 33:e193. 13. Carter, S., A. Eklund, B. Mecham, I. Kohane, and Z. Szallasi. 2005. Redefinition of Affymetrix probe sets by sequence overlap with cDNA microarray probes reduces crossplatform inconsistencies in cancer-associated gene expression measurements. BMC Bioinformatics 6:107. 14. Dai, M., P. Wang, A.D. Boyd, G. Kostov, B. Athey, E.G. Jones, W.E. Bunney, R.M. Myers, et al. 2005. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 33:e175. 15. Affymetrix. 2001. Array design for the gene chip human genome u133 set. Technical note. Affymetrix, Santa Clara, CA. Vol. 42 ı No. 2 ı 2007

16. Affymetrix. 2005. GeneChip exon array design. Technical note. Affymetrix, Santa Clara, CA. 17. Wheeler, D.L., D.M. Church, R. Edgar, S. Federhen, W. Helmberg, T.L. Madden, J.U. Pontius, G.D. Schuler, et al. 2004. Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res. 32(suppl 1):D35-D40. 18. Benson, D.A., I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and D.L. Wheeler. 2005. GenBank. Nucleic Acids Res. 33(suppl 1): D34-D38. 19. Boguski, M.S., T.M. Lowe, and C.M. Tolstoshev. 1993. dbEST—database for “expressed sequence tags”. Nat. Genet. 4:332333. 20. Pruitt, K.D., T. Tatusova, and D.R. Maglott. 2005. NCBI reference sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33(suppl 1):D501-D504. 21. Gautier, L., M. Moller, L. Friis-Hansen, and S. Knudsen. 2004. Alternative mapping of probes to genes for Affymetrix chips. BMC Bioinformatics 5:111. 22. Gentleman, R., V. Carey, D. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, et al. 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5: R80. 23. Gautier, L., L. Cope, B.M. Bolstad, and R.A. Irizarry. 2004. affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20:307-315. 24. Wilson, C.L. and C.J. Miller. 2005. Simpleaffy: a bioconductor package for Affymetrix quality control and data analysis. Bioinformatics 21:3683-3685. 25. Affymetrix. 2002. Statistical algorithms description document. Affymetrix, Santa Clara, CA. 26. Tusher, V.G., R. Tibshirani, and G. Chu. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98:5116-5121. 27. Efron, B., R. Tibshirani, J.D. Storey, and V. Tusher. 2001. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96:1151-1160. 28. Smyth, G.K. 2004. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3:Article 3.

Received 18 August 2006; accepted 2 October 2006. Address correspondence to Michał J. Okoniewski, Bioinformatics Group, Cancer Research UK, The Paterson Institute for Cancer Research, The University of Manchester, Christie Hospital Site, Wilmslow Road, M20 4BX, Manchester, UK. e-mail: [email protected] To purchase reprints of this article, contact: [email protected]

www.biotechniques.com ı BioTechniques ı 185