BMC Genomics - CSIC Digital

3 downloads 0 Views 465KB Size Report
Apr 3, 2007 - Address: 1Neural Plasticity Department, Instituto de Neurobiología Ramón y Cajal (CSIC), Avda. Doctor Arce ... Sampedro - [email protected].
BMC Genomics

BioMed Central

Open Access

Research article

Cross-species analysis of gene expression in non-model mammals: reproducibility of hybridization on high density oligonucleotide microarrays Manuel Nieto-Díaz*1,2, Wolfgang Pita-Thomas1 and Manuel NietoSampedro1,2 Address: 1Neural Plasticity Department, Instituto de Neurobiología Ramón y Cajal (CSIC), Avda. Doctor Arce 37, 28002 Madrid, Spain and 2Experimental Neurology Unit, Hospital Nacional de Parapléjicos (SESCAM), Ctra. La Peraleda s/n, 45071 Toledo, Spain Email: Manuel Nieto-Díaz* - [email protected]; Wolfgang Pita-Thomas - [email protected]; Manuel NietoSampedro - [email protected] * Corresponding author

Published: 3 April 2007 BMC Genomics 2007, 8:89

doi:10.1186/1471-2164-8-89

Received: 5 September 2006 Accepted: 3 April 2007

This article is available from: http://www.biomedcentral.com/1471-2164/8/89 © 2007 Nieto-Díaz et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: Gene expression profiles of non-model mammals may provide valuable data for biomedical and evolutionary studies. However, due to lack of sequence information of other species, DNA microarrays are currently restricted to humans and a few model species. This limitation may be overcome by using arrays developed for a given species to analyse gene expression in a related one, an approach known as "cross-species analysis". In spite of its potential usefulness, the accuracy and reproducibility of the gene expression measures obtained in this way are still open to doubt. The present study examines whether or not hybridization values from cross-species analyses are as reproducible as those from samespecies analyses when using Affymetrix oligonucleotide microarrays. Results: The reproducibility of the probe data obtained hybridizing deer, Old-World primates, and human RNA samples to Affymetrix human GeneChip® U133 Plus 2.0 was compared. The results show that crossspecies hybridization affected neither the distribution of the hybridization reproducibility among different categories, nor the reproducibility values of the individual probes. Our analyses also show that a 0.5% of the probes analysed in the U133 plus 2.0 GeneChip are significantly associated to un-reproducible hybridizations. Such probes-called in the text un-reproducible probe sequences- do not increase in number in cross-species analyses. Conclusion: Our study demonstrates that cross-species analyses do not significantly affect hybridization reproducibility of GeneChips, at least within the range of the mammal species analysed here. The differences in reproducibility between same-species and cross-species analyses observed in previous studies were probably caused by the analytical methods used to calculate the gene expression measures. Together with previous observations on the accuracy of GeneChips for cross-species analysis, our analyses demonstrate that cross-species hybridizations may provide useful gene expression data. However, the reproducibility and accuracy of these measures largely depends on the use of appropriated algorithms to derive the gene expression data from the probe data. Also, the identification of probes associated to un-reproducible hybridizations-useless for gene expression analyses- in the studied GeneChip, stress the need of a re-evaluation of the probes' performance.

Page 1 of 18 (page number not for citation purposes)

BMC Genomics 2007, 8:89

Background DNA microarray technology is a basic tool to measure genomewide changes in gene expression. Microarray analysis of gene expression in non-model mammals may provide very valuable data for biomedical [1-3] or evolutionary [4-7] studies. However, DNA microarrays are currently restricted to humans and a few model species, due to lack of sequence information for other species. This limitation could be overcome by using arrays developed for a given species to analyse gene expression in a related one [4,8-12]. This approach, known as "cross-species analysis", assumes that the RNA transcripts for one species will hybridize efficiently with the arrayed sequences of another species, provided that both species share enough sequence similarity (over 95% in orthologous 3'-UTR sequences according to Nagpal et al. [10]). The cross-species approach has been employed in several studies in mammals, using human microarrays to analyse closely related species, such as chimpanzees, orangutans and other primates [4], as well as more distantly related species, such as pigs, cows or dogs [8,9]. These studies assume that the short time of divergence between mammals (less that 100 million years) and the preservation of their protein function assures enough nucleotide-sequence conservation among species [9].

http://www.biomedcentral.com/1471-2164/8/89

this way is open to doubt. Two aspects of measurement quality appear as most important i) the accuracy of the measurement-the agreement between the observed and the true value of a measure; termed validity in statistical terminology-, and ii) the reproducibility (or precision; also called reliability in statistical jargon) of the measurement, i.e. whether repeated measurements will give similar values [14]. Different authors have examined both aspects of cross-species analyses using Affymetrix GeneChips, reaching diverse conclusions [9,10,15-17]. There is a general agreement in that the array sensitivity and, thus, the accuracy of the analysis, decreases with increasing sequence differences between the species being analyzed and the array species [9,15,16]. In a practical sense, this implies that cross-species analyses yield significantly more false negatives-genes that appear not to be expressed although they are really being expressed- than same-species analyses. This point was clearly illustrated by Chismar and co-workers [15] showing that the number of detected transcripts by a human GeneChip were a 50% lower when analysing Macaca samples than when analysing human samples. Accordingly, various authors have developed specific methods to correct the sensitivity reduction of cross-species analysis [9,10,16].

Among the existing DNA array platforms, Affymetrix high-density oligonucleotide GeneChips® (Affymetrix, Santa Clara, CA, USA) have been repeatedly employed for cross-species analyses. GeneChips estimate gene expression measures-like the presence and abundance of a transcript- by applying analytical methods to the hybridization values of sets of 11 to 20 pairs of probes (probesets) for each transcript [13]. Each probe pair consists of a 25 bases long perfect match probe (PM), fully complementary to the target, and a 25 bases long mismatch probe (MM), that shares only 24 bases with the target sequence. The large number of probes per target used by Affymetrix microarrays represents an advantage for cross species analyses with respect to other microarray platforms, such as those based on cDNA probes. The presence of 11 to 20 probes per target increases the probability of having probes with enough sequence similarity with the target transcript to obtain a feasible measure of its expression [9]. In contrast, the long sequence probes in cDNA microarrays may favour the hybridization with orthologous genes from other species compared to the 25 bases long Affymetrix probes. Genechips also have the advantage of allowing worldwide researchers to access the same standardized arrays, the same sample processing methods, and the same image acquisition instruments to quantify gene expression.

Data reproducibility has received less attention despite the fact that it is not possible to achieve accuracy in individual measurements if these measurements are associated to high variability. The percentage of transcripts that can be consistently detected as present (according to an Affymetrix algorithm [13]) across replicates is significantly reduced when Macaca RNA samples were hybridized to the Affymetrix Hu95Av5 human GeneChip, than when human RNA samples were hybridized to this chip [15]. A similar reduction in reproducibility of the present calls was observed by Wang et al. [17] when analysing Macaca and Pan samples with human GeneChips. In contrast, Dillman and Phillips [18] observed no reproducibility differences when the gene expression data generated by hybridizing human, Chlorocebus and Macaca RNA samples to the Affymetrix U133 plus 2.0 human GeneChip were compared, in agreement with previous observations from cross-species analyses using cDNA microarrays [8,19]. What is the real effect of cross-species analyses on the reproducibility of Affymetrix hybridization data? To explain these apparently contradictory results, we hypothesized that hybridization values were equally reproducible in cross-species and same-species analyses at least when restricted to mammals, and that contradictions arose from differences in the way gene expression measures were derived from the hybridization data of the Affymetrix probes.

Despite the potential usefulness of cross-species analyses, the quality of the gene expression measures obtained in

To test these hypotheses, we have compared the reproducibility of hybridization data from different mammal sam-

Page 2 of 18 (page number not for citation purposes)

BMC Genomics 2007, 8:89

ples analysed with the Affymetrix human GeneChip U133 Plus 2.0. Unlike previous studies that analysed reproducibility of gene expression measures like signal intensity or detection call, we have used probe intensity data. In this way, we strictly analysed hybridization reproducibility, leaving out the effect of the algorithms used to calculate the gene expression measures from the probe intensity values. We used specifically data from hybridization to GeneChip U133 Plus 2.0, to take advantage of the presence of probes with the same sequences in different probesets of this GeneChip. These internal replicas permit the study of reproducibility within the same chip, avoiding variations due to differences in the GeneChips or in their processing (including hybridization, staining and scanning), which may be responsible for much of the total variation in microarray analyses [20]. Probe intensity data originate from the analysis of human, three species of Old-World primates, and deer RNA samples. We compared data from such a differently related taxa to evaluate the effect of increasing sequence differences on hybridization reproducibility. Old World monkeys are closely related to humans (time of divergence 25–30 millions of years [21,22]) while deers (a representative of the Artiodactyla order) represent a more distantly related taxon, whose ancestors diverged from the ancestors of primates at the base of the Placental radiation, nearly 100 million of years ago [21,23]. Hybridization data from these species were used to test the following: 1. Whether cross-species hybridizations affected the distribution of the hybridization reproducibility, i.e., whether the number of sequences in different reproducibility categories was similar in same and cross-species analyses. 2. Whether there were probe sequences associated to irreproducible or poorly reproducible hybridizations, and whether cross-species analyses increased the proportion of these sequences with respect to same-species analyses. 3. Whether the reproducibility of each sequence tended to be lower in cross-species hybridizations than in same-species hybridizations. To further test the effect of sequence differences on hybridization reproducibility, we repeated these three analyses comparing the hybridization data for repeated perfect match (PM) and mismatch (MM) sequences in the human samples. These comparisons permit to evaluate the effect of a known and fixed sequence difference (one change in the 13th base) on hybridization reproducibility. Finally, we studied the relationship between hybridization value and reproducibility, to test whether low hybridization values were less reproducible than high values and thus, whether cross-species analyses were less reproduci-

http://www.biomedcentral.com/1471-2164/8/89

ble because they yielded lower hybridization signals than same-species analyses. The results presented show that, within the range of the mammals studied, cross-species analyses do not significantly affect hybridization reproducibility and suggest that the analytical methods used to calculate the gene expression measures are responsible for the previously observed reproducibility differences between same-species and cross-species analyses. In parallel, we have identified several probe sequences associated to poorly reproducible hybridizations that should not to be taken into consideration for quantifying gene expression.

Results Identification and characterization of the GeneChip repeated sequences Affymetrix U133 Plus 2.0 GeneChip contains 15472 sequences (7736 PM and 7736 MM) repeated in at least two different probes. These sequences correspond to 17462 of the 604258 probe pairs present in this GeneChip. Twenty six of the repeated sequences (13 PM and 13 MM), corresponding to 52 probes (26 PM and 26 MM), hybridize to non-human RNA transcripts included in the array as spikes to test the array performance. The number of repetitions ranges from 2 to 20, with most sequences repeated just once (Table 1). A description of all repeated probes in the U133 plus 2.0 GeneChip-including the probeset to which the probe belongs, the oligonucleotide sequence, and the position of the probe in the GeneChipis provided in the Additional file 1.

RNA samples used in all analyses are detailed in table 2. Hybridization values of each sample for all repeated sequences are provided in the Additional file 1. Normal distribution of these values was checked in 22 sequences with more than 6 repetitions, using Shapiro-Wilk W test (see Additional file 2). Hybridization values were normally distributed across replicas in most cases, though for some sequences differed significantly (p < 0.05) from the normal distribution, even after applying a Bonferroni correction. Non-normality in these cases was due to the presence of one or a few outlier probes with extreme values (see graphs in Additional file 2). No differences in normality were apparent between species or between PM and MM data. Kolmogorov Smirnoff and Lilliefors tests yielded comparable results (data not shown). Effect of sequence differences on the distribution of hybridization variability The coefficient of variation (CV) of the hybridization values for the repeated PM sequences in each sample for the various species was chosen as a measurement of hybridization reproducibility and used to analyse the hybridization reproducibility differences between same-species and

Page 3 of 18 (page number not for citation purposes)

BMC Genomics 2007, 8:89

http://www.biomedcentral.com/1471-2164/8/89

Table 1: Number of repeated probe pairs and sequences in the Affymetrix U133 Plus 2.0 GeneChip®.

Number of Repetitions

Number of Sequences

Number of Probe Pairs

2 3 4 5 6 7 8 9 11 12 15 16 20 Total

12794 1924 462 174 74 22 8 4 2 2 2 2 2 15472

12794 2886 924 435 222 77 32 18 11 12 15 16 20 17462

Both PM and MM sequences are considered to calculate the number of repeated sequences.

cross-species analyses. Although within-subject standard deviation is the most commonly used measurement of reproducibility, it was rejected due to its significant correlation with the mean in all samples (see data in Additional file 3 and analysis in Additional file 4). A brief discussion justifying the use of CV to measure data reproducibility is provided in the methods section (fully discussed in [14,24,25]). CV data for all samples and sequences may be obtained from the Additional file 3. A two-way ANOVA was used initially to test whether or not all species presented a similar number of sequences in each of the following classes of decreasing reproducibility: CV1 0.002

CV>0.75 0.003

CV>0.5 0.005

All probes NRR Seqs.

12

14

27

PM probes NRR Seqs.

1

9

8

CV>1

No associated 7721

PM probes

All Human Probes

MM probes

1

21

2

CV>0.75 PM probes

All Human Probes

MM probes

9

14

6

CV>0.5

No associated 7696

MM probes NRR Seqs.

2

6

5

No associated 7707

PM probes

All Human Probes

MM probes

8

27

5

Figure PM and 4 MM un-reproducible probe sequences (UPS) for human samples PM and MM un-reproducible probe sequences (UPS) for human samples. Results are detailed for MM and PM sequences, and for CV over 0.5, 0.75 and 1. All details as in figure 3.

ization values remains almost constant. This is perfectly illustrated by our results from the deer analyses. Sequence differences between the deer samples and the human microarray cause a strong decrease in the hybridization reflected in the fact that, after scaling, less than 10% of the genes arrayed in the U133 plus 2.0 GeneChip are detected as present (according to MAS 5.0 algorithm, data not shown) but, as shown by our results, they do not cause any significant change in the reproducibility of the hybridization values. Previous analysis of non-human primate and human gene expression with human GeneChips showed reproducibility differences between same-species and cross-species analyses [15,17]. These studies found that, although the number of genes that change their detection call (presence or absence of a transcript) across replicates (flux genes of Chismar et al. [15]) was similar, there was a clear difference in the distribution of the variability between samespecies and cross-species analyses. Flux genes were more associated to higher signal intensities in cross-species analysis than in same-species ones [15]. As a result the number of genes consistently detected across replicates showed a significant decrease in cross-species analyses (from 24 to 12% according to Wang et al. [17]) and the size of the flux gene set as a percentage of the genes called present became clearly higher in cross-species analyses than in same-species analyses (27% vs. 17% in the analysis of [15]). The apparent contradiction of these observations and our results is probably due to the differences in

the data employed. In the present study, we have used raw intensity values, whereas both Detection Call and Signal Intensity are gene expression measures that summarize the information on the probe level data after applying various analytical methods to the intensity values [27]. These analytical methods imply the use of several algorithms to normalize and quantify probe hybridization, followed by the application of other algorithms to summarize the information of the probes of each probeset into gene expression values. The analytical methods most commonly used, such as MAS 5.0 [28], dChip [29] or RMA (Robust Multi-chip Average [27]), differ in their approach and performance and seem to affect the reproducibility of the gene expression measures in both same-species and cross-species analyses [15,17]. In fact, the reproducibility of Detection Call and Signal Intensity appears to depend on the algorithm employed, with RMA and dChip algorithms yielding more reproducible results than the Affymetrix MAS5 [17]. Our study demonstrates, in this respect, that the reproducibility differences of gene expression measures seen in some previous cross-species analyses [15,17] were not due to differences in hybridization reproducibility. Rather, they were probably caused by the processing of the hybridization data, that is, by the algorithms used to estimate the presence and abundance of the transcripts. This conclusion may also explain why cDNA microarrays do not show any change in the reproducibility of their gene expression measures when performing cross-species analyses [8,19], while Affymetrix oligonucleotide microarrays do. Since cDNA microarrays

Page 10 of 18 (page number not for citation purposes)

BMC Genomics 2007, 8:89

http://www.biomedcentral.com/1471-2164/8/89

90% values

50% values mean

Human Samples Data HSAP1 HSAP2 HSAP3 HSAP4 HSAP5 -0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.10

0.04

0.06

0.08

0.10

Species Mean Data DEER CAET MFAS MMUL -0.08

-0.06

-0.04

-0.02

0

0.02

Relative Hybridization Variability (CVspecies j - mean CVhuman)

Mean SD t=mean/(SD/sqrt(N))

DEER

CAET

MFAS

MMUL

-0.0104 0.0599

-0.0073 0.0508

-0.0115 0.0526

-0.0050 0.0501

15.2422

12.6075

19.2382

8.8502

sqrt(N)=87.95453371 D.F.=7735

p (one-tailed)*

A