RefPlus : an R package extending the RMA Algorithm

2 downloads 0 Views 150KB Size Report
Jul 10, 2007 - data from all microarrays. (3) Expression Calculation: estimated separately for each probeset using median polish on the linear model: ( ).
Bioinformatics Advance Access published July 10, 2007

Gene Expression

RefPlus : an R package extending the RMA Algorithm Chris Harbron1,*, Kai-Ming Chang2 and Marie C. South3 1

Statistical Sciences, AstraZeneca, Alderley Park, Macclesfield, Cheshire SK10 4TG, UK. Department of Research, Koo Foundation Sun Yat-Sen Cancer Center, 125 Lihder Road, Taipei 112 Taiwan. 3 Cancer Discovery Medicine, AstraZeneca, Alderley Park, Macclesfield, Cheshire SK10 4TG, UK. 2

Associate Editor: Prof. David Rocke

(Affymetrix, 2005), where a model file fitted by a set of microarrays can be stored and used later. The use of the RMA algorithm for processing large numbers of microarrays can be limited by available computer memory. One approach is to apply the Extrapolation Strategy, using a subset of microarrays as the reference set and processing the remaining microarrays using the parameters calculated from this reference set. Alternatively the Extrapolation Averaging (Goldstein, 2006) algorithm gives an improved approximation to RMA by averaging multiple Extrapolation Strategy results over different reference sets.

2

ALGORITHMS

2.1

RMA

RMA consists of three steps:

1

INTRODUCTION

It is often necessary to analyze microarray data at one or more interim stages throughout the course of a study. Multiplemicroarray pre-processing algorithms for Affymetrix microarrays such as RMA (Irizarry et al., 2003) have the undesirable property that the probeset intensities change when microarrays are re-preprocessed due to the inclusion of additional microarrays. A similar situation can occur when developing and applying prediction or classification models using microarrays. Any new sample that is to be predicted by the model will need to be pre-processed and preprocessing this sample along with the training set of samples used to develop the model will change the probeset intensities of these microarrays and the parameters of the fitted model. An extension to RMA, the Extrapolation Strategy, provides a solution to these problems. This method was independently developed by Goldstein (2006) and also by Katz et al. (2006) as refRMA. It avoids having to re-pre-process already pre-processed microarrays when new arrays are added to the data set, but still maintains many of the desirable properties of RMA. RMA is applied to a reference set of microarrays, storing the parameters of the RMA fit. To process additional microarrays, these parameters are directly applied, without any re-estimation, to the new microarrays leaving the gene expression measurements of the reference microarrays unchanged. A similar strategy has also been considered in the PLIER algorithm

(1)

Background Correction: probe-level data for each microarray are background corrected independently using a probabilistic model.

(2)

Quantile Normalisation: the background corrected probelevel data on each microarray are normalized to a common set of quantiles, derived from background corrected data from all microarrays.

(3)

Expression Calculation: estimated separately for each probeset using median polish on the linear model:

log 2 (N ij ) = Pj + I i + ε ij

(1)

where Ii is the logarithmic intensity for the ith microarray, Nij is the background corrected and normalised intensity of the jth probe of the ith microarray, Pj is the effect of the jth probe in the probeset and εij is an error term. For further details on the RMA algorithm refer to Irizarry et al. (2003)

2.2

Extrapolation Strategy

The Extrapolation Strategy divides the set of microarrays into two distinct sets: the reference set used to generate reference sets of parameters for future processing and the future set of all other

To whom correspondence should be addressed.

*

© The Author (2007). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

1

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on June 13, 2013

ABSTRACT Summary: RMA has become a widely used methodology to preprocess Affymetrix gene expression microarrays. A limitation of RMA is that the calculated probeset intensities change when a set of microarrays is re-pre-processed after the inclusion of additional microarrays into the analysis set. Here we report the availability of the RefPlus package containing functions to perform the Extrapolation Strategy and Extrapolation Averaging algorithms which address these issues. Availability: The software is implemented in the R language and can be downloaded from the Bioconductor project (http://www.bioconductor.org). Contact: [email protected] Supplementary information: Further details of the workings and evaluation of these functions are given in the documentation available on the Bioconductor website.

C. Harbron et al.

in one batch by RMA within the available computer memory.

microarrays which are subsequently processed. The Extrapolation Strategy consists of four steps: (1)

RMA: RMA is applied to the reference set to obtain the probeset intensities of the reference set microarrays. The reference quantiles and reference probe effects are stored.

(2)

Background Correction: as in RMA, applied to the future set.

(3)

Normalisation: the background corrected probe level data from the future microarrays are quantile normalised to the reference quantiles.

(4)

~ I f = median (log N fj − Pj ) j∈Probes

(2)

Figure 1 compares the relationships between variables for RMA and the Extrapolation Strategy. RMA Raw Probe Intensities Background Corrected Probe Intensities



Extrapolation Strategy Reference Set Future Set …









Reference Quantiles Normalised Probe Intensities Reference Probe Effects Probeset Intensities





2.3 Extrapolation Averaging Extrapolation Averaging consists of repeated application of the Extrapolation Strategy using different reference sets and can be described in four steps:

2

(3)

Repeat steps 1 and 2 several times.

(4)

Calculate the probeset intensities as an average on the log2 scale of the gene expression profiles calculated in steps 1-3.

Any additional microarrays can be pre-processed by using the Extrapolation Strategy to calculate a gene expression profile based on the saved parameters from all of the reference sets and averaging these gene expression profiles across the reference sets.

3

CONCLUSIONS

The RMA algorithm has been found to have good performance characteristics in the pre-processing of Affymetrix gene expression data (Irizarry et al., 2006). A limitation of RMA is that the probe intensities change when the analysis set of microarrays changes. This can be an issue when a study is analysed at interim stages as the data for the same samples will vary between analyses. This property also makes the application of predictive models difficult as additional microarrays need to be pre-processed to apply the model, but without changing the model parameters. Also for large sets of microarrays, computer memory can also be limiting to performing RMA. The Extrapolation Strategy and Extrapolation Averaging algorithms implemented in the R package RefPlus provide an easily applied solution to these issues. An evaluation using the data from Bhattacharjee et al. (2001) showing that the Extrapolation Strategy and Extrapolation Averaging algorithms provide a close approximation to RMA, even in challenging situations, can be found along with the R package on the Bioconductor website.

ACKNOWLEDGEMENTS

Fig. 1. Graphical representations of the RMA and Extrapolation Strategy algorithms. In both RMA and the Extrapolation Strategy the calculated probeset intensities are dependent on both the normalising quantiles and the probe effects and given these terms the calculated probeset intensities of all microarrays are conditionally independent of each other. In RMA these terms and so the calculated probeset intensities depend on all of the other microarrays. In the Extrapolation Strategy the calculated probeset intensities only depend only on the probe intensities from that microarray and the reference quantiles and reference probe effects, calculated only from the reference set of microarrays.

(1)

Apply the Extrapolation Strategy to this reference and future set.

Randomly select n microarrays as a reference set, the remainder of the microarrays form the future set. n is the maximum number of microarrays that can be processed

The authors would like to acknowledge colleagues within AstraZeneca who provided valuable suggestions and comments, and thank the authors of Bhattacharjee et al. (2001) who permitted the use of their microarray data.

REFERENCES Affymetrix. (2005) Guide to Probe Logarithmic Intensity Error (PLIER) Estimation. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M et al. (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. PNAS, 98, 13790-13795. Goldstein DR. (2006) Partition resampling and extrapolation averaging: approximation methods for quantifying gene expression in large numbers of short oligonucleotide arrays. Bioinformatics, 22, 2364-2372. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostat., 4, 249-264. Irizarry RA, Wu Z and Jaffee HA. (2006) Comparison of Affymetrix GeneChip expression measures. Bioinformatics, 22, 789-794. Katz S, Irizarry R, Lin X, Tripputi M, Porter M. (2006) A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database. BMC Bioinformatics, 7, 464.

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on June 13, 2013

Expression Calculation: the probeset intensities of the future microarrays are estimated using the model (1) assuming that the probe effects of the future microarrays are the same as the probe effects of the reference set. The estimated logarithmic intensity Ĩf of a probeset on a future array is:

(2)