Evolving gene/transcript definitions significantly alter ...

Nucleic Acids Research, 2005, Vol. 33, No. 20 e175 doi:10.1093/nar/gni179

Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data Manhong Dai, Pinglang Wang, Andrew D. Boyd1, Georgi Kostov1, Brian Athey1, Edward G. Jones2, William E. Bunney3, Richard M. Myers4, Terry P. Speed5, 5 Huda Akil, Stanley J. Watson and Fan Meng* Molecular and Behavioural Neuroscience Institute and Department of Psychiatry, University of Michigan, Ann Arbor, MI 48109, USA, 1Michigan Center for Biological Information, University of Michigan, Ann Arbor, MI 48105, USA, 2Department of Psychiatry and Center for Neuroscience, University of California, Davis, CA 95616, USA, 3Department of Psychiatry and Human Behavior, University of California, 4 10 Irvine, CA 92697, USA, Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA and 5Department of Statistics, University of California, Berkeley, CA 94720, USA Received September 26, 2005; Revised October 17, 2005; Accepted October 25, 2005

ABSTRACT Genome-wide expression profiling is a powerful 15 tool for implicating novel gene ensembles in cellular mechanisms of health and disease. The most popular platform for genome-wide expression profiling is the Affymetrix GeneChip. However, its selection of probes relied on earlier genome and transcriptome 20 annotation which is significantly different from current knowledge. The resultant informatics problems have a profound impact on analysis and interpretation the data. Here, we address these critical issues and offer a solution. We identified several classes of 25 problems at the individual probe level in the existing annotation, under the assumption that current genome and transcriptome databases are more accurate than those used for GeneChip design. We then reorganized probes on more than a dozen popular 30 GeneChips into gene-, transcript- and exon-specific probe sets in light of up-to-date genome, cDNA/EST clustering and single nucleotide polymorphism information. Comparing analysis results between the original and the redefined probe sets reveals 30–50% 35 discrepancy in the genes previously identified as differentially expressed, regardless of analysis method. Our results demonstrate that the original Affymetrix probe set definitions are inaccurate, and many conclusions derived from past GeneChip analyses 40 may be significantly flawed. It will be beneficial to re-analyze existing GeneChip data with updated probe set definitions.

INTRODUCTION While extensive attention has been devoted to improving the accuracy and sensitivity of the statistical algorithms used 45 to estimate gene expression levels and to detect differential expression in GeneChip-based expression analyses (1–4), problems related to probe and probe set identity likely lead to significant errors, especially under conditions where expression changes are not dramatic. GeneChips for expression 50 analysis use probe sets containing 11–20 pairs of 25mer oligonucleotides to represent a target gene or transcript. Each oligonucleotide pair consists of an oligo with perfect match to a target sequence region (PM probe) and another oligo with a single base mismatch in the center of the oligo (MM probe) 55 to the same target region. Although Affymetrix utilized the most complete information available at the time of GeneChip design, tremendous progress in genome sequencing and annotation in recent years renders existing GeneChip probe set designs suboptimal. For example, when the HG-U133 chip 60 set was designed, the human UniGene Build 133 contained 2.8 million cDNA/EST sequences and the human genome sequence was only 25% complete (5). Currently, the human UniGene builds contain over 5 million sequences and the human genome build 35 has 99% of the euchromatic portion 65 of the genome sequenced (6). Our analysis indicates that many of the old probe sets do not faithfully reflect the expression levels of a significant number of genes in a given tissue due to several informatics-related issues which impact probe identity. It should be pointed out that three recent papers also investi- 70 gated some of the problems for the HG-U133A, HG-U95A and HG-U133 Plus 2.0 GeneChips but no systematic solution was provided (7–9). For example, Harbig et al. re-annotated 37% of the probes on the HG-U133 Plus 2.0 array based on

*To whom correspondence should be addressed. Tel: +1 734 615 7099; Fax: +1 734 647 4130; Email: [email protected] The Author 2005. Published by Oxford University Press. All rights reserved. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected]

e175

Nucleic Acids Research, 2005, Vol. 33, No. 20

BLAST sequence match. They also envisioned several ways to automate the GeneChip probe annotation process and called help from the bioinformatics community. Here, we present a user-friendly solution compatible with all existing GeneChip 5 analysis software. The impact of the updated probe set definition on the interpretation of GeneChip data is also evaluated using a public domain data set. METHODS 10

Generation of reorganized probe sets based on the UniGene database

We want to describe the procedure for generating the UniGenebased probe set definitions first since the UniGene database is the most widely used gene classification system and most researchers will first map the GeneChip results to the UniGene 15 database in order to understand the biological significance of the GeneChip data. Generating UniGene-based probe sets is complicated because UniGene clusters often contain multiple sequences of unknown reliability and strand direction. The following are the steps in our UniGene probe set 20 re-organization process. These steps are applied in the order presented. (i) Perform sequence alignments. There are two components in this step: (a) map all GeneChip probe sequences to individual sequences in UniGene, dbSNP and the genome sequence 25 of the corresponding species. Only perfect matches are retained. (b) Align all sequences in the UniGene database to the most current genome assembly of the corresponding species. Since UniGene clusters can frequently contain sequences from other genes, we will use the genome alignment 30 results in later steps to provide some easily automatable cleanups. (ii) We require each probe in a probe set should only have one perfect match with the corresponding genome sequence. This may exclude probes that also match a non-transcribed 35 region in the genome, but this filter is not dependent on the completeness of genome annotation or cDNA/EST sequence collection and, therefore, should be more stable in the long term. The ubiquitous presence of non-coding transcripts also supports the use of this somewhat more aggressive strategy 40 (10). Given the fact that only 10% or less probes are eliminated by this criterion, we believe the slight drop in statistical power for each probe set is a worthwhile price for the gain in the confidence in the final results. (iii) Because EST sequences are subject to a relatively high 45 error rate, we require a probe to perfectly match a genomic region that can be aligned with mRNA/EST sequences collected in the UniGene database. Probes with perfect match only to EST sequences but not to the corresponding genomic sequence will not be included in the final probe sets. An 50 exception to this rule is exon–exon junction probes with perfect match to the mRNA reference sequences in the same UniGene cluster. We add such exon–exon junction probes back to the corresponding probe set and assign the lowest probe pair number(s). 55 (iv) In order to ensure that a probe is specific for one UniGene cluster, we eliminate probes with multiple matching cDNA/EST sequences that can be assigned to more than one UniGene cluster. Our previous requirement for genomic

PAGE 2

OF

9

sequence alignment can reduce false non-specific probes caused by erroneous EST sequencing or contaminating EST 60 sequences. Although this filter may remove good probes due to errors in UniGene clustering, it will guarantee that every probe set is consistent with current UniGene clustering. In theory, a probe with the potential to hybridize with transcripts from more than one gene is still useful if the unin- 65 tended transcripts are not expressed at a level that leads to significant probe signal interference in a given tissue or sample. However, while tissue- or sample-dependent probe set definitions can increase GeneChip probe utilization, the simplicity and the consistency of a tissue/sample-independent 70 probe set definition should be more advantageous in most situations. In the ideal situation, a gene-specific probe set should only contain probes whose sequence will be present on the shared sequences of all splicing products from the same gene, 75 as the signal level of such a probe set will not be influenced by the alternative splicing in different tissues or individuals. For most genes, current knowledge of potential alternative splicing products is far from complete. Thus, we choose to pool all probes targeting the same gene for the definition 80 of a gene-specific probe set. We believe the gene-based probe set definition is useful for evaluating the overall transcription activity of a gene, which is in fact claimed in most microarray research papers. Potential alternative splicing events can conceivably be explored by our transcript- or exon-specific 85 probe sets. Since some researchers prefer to examine probes targeting at the 30 end of transcripts, we also created probe sets that contain no more than 11 probe pairs at the most 30 end. (v) Except for genes with known mRNA/reference sequences, we require all probes in each probe set be aligned 90 in the same direction on the genome, as old probe sets representing the same gene can sometimes target different strands of the same transcript. This constraint ensures the directional homogeneity of newly defined probe sets during the merging of probes from multiple old probe sets. Probes with perfect 95 match to the same genomic region but in different directions are separated into two probe sets if there is no mRNA/reference sequence in the UniGene cluster for determining the transcription orientation on the genome. (vi) We also require that probes targeted to the same 100 UniGene cluster be aligned continuously on the genomic sequence in the same direction. For example, if probes representing a UniGene cluster are distributed across different genomic regions or chromosomes, the largest continuous set of probes will be used to represent this UniGene cluster. All 105 other probes intermingling with probes targeting different UniGene clusters will be omitted from the final probe set. An exemption to this rule is when an mRNA reference sequence in a UniGene cluster can be aligned to different genomic locations, as mRNA reference sequences are 110 probably more reliable than the current version of genome assembly. (vii) Each probe set should contain at least three probe pairs. Targets that cannot be represented by at least three probe pairs are eliminated in the final probe set definition. This threshold is 115 largely arbitrary, but a probe set with three probe pairs should satisfy the minimum requirements of most probe-level analysis algorithms. In our new UniGene probe set definitions, probe sets containing three or four probe pairs account for

PAGE 3

OF

9

cDNA > EST. If there is still a 50 tie, the shortest sequence is selected. There are a number of situations where the above procedure still leads to multiple sequences and we simply pick the accession number with the lowest alphabetic order as the designated best accession number for a probe set. 55 All procedures described above are implemented on a 4· dual opteron/8 GB memory cluster and a dual Itanium Oracle server with 16 GB of memory. We usually generate a new CDF build every 3–4 months and each build takes 40

Nucleic Acids Research, 2005, Vol. 33, No. 20

e175

10 days to finish on our current setup. A total of six custom CDF builds have been generated since early last year.

60

Use of custom CDF. Custom CDF files can be easily selected based on species, Affymetrix GeneChip type, CDF file type and CDF file format on our CDF download grid. The following are three examples that cover all common GeneChip probe-level analysis situations.

65

Example 1. Use custom CDF in Affymetrix MAS5 or standalone dCHIP: the ASCII format CDF files are for Affymetrix MAS5 and standalone dCHIP programs. After unzipping the ASCII CDF package, an ASCII format custom CDF file can be used exactly the same way as an Affymetrix 70 CDF file. Please note that the dCHIP program only accepts Affymetrix CDF names thus one has to change the name of a custom CDF file to the corresponding Affymetrix CDF file name. Example 2. Use custom CDF in R environment by calling 75 custom CDF packages directly on a computer with Internet link. The following is an example session: library(affy) data