The following R routines are provided in the file ebayes.l2e.r (available at .... We now analyze the data from Section 2
R routines for partial mixture estimation and differential expression analysis David Rossell
1
∗
Introduction
The following R routines are provided in the file ebayes.l2e.r (available at http://rosselldavid.googlepages.com).
ebayes.l2e: core routine that performs 2 group differential expression analysis via partial mixture estimation. Additionally, it implements pairwise t and Wilcoxon tests with Benjamini & Hochberg’s p-value adjustment.
qqnorm.pme: creates qq-normal plot to assess the partial mixture assumption that the equally expressed genes are normally distributed.
ma.boxplot: creates MA-plot to assess the assumption that the differences between group means are identically distributed.
find.fdr: auxiliary routine that, given the posterior probability that each gene is differentially expressed, computes the Bayesian FDR (1).
find.threshold: auxiliary routine that, given the posterior probability that each gene is differentially expressed, computes the optimal threshold to declare significance while controling the Bayesian FDR (2).
wupdc: fits a partial Normal mixture component via weighted L2 E distance minimization.
rewupdc: fits a partial Normal mixture component via iteratively reweighted L2 E distance minimization.
∗
Department of Biostatistics, The University of Texas M. D. Anderson Cancer Center, Houston, TX.
1
wupdc.findw: fits a partial Normal or partial T component via weighted L2 E minimization, fixing the parameters of the component so that only its weight (i.e. the proportion of differentially expressed genes) needs to be estimated.
These routines only implement the weighted L2 E criterion for partial mixture estimation. For them to work correctly one should also download and source the routines that implement the non-weighted L2 E criterion. These can be found in the file mpdc.r, which is available at David W. Scott’s page (http://www.stat.rice.edu/ scottdw/code/l2e).
2
Partial mixture estimation estimating the null distribution
Let’s start by simulating expression data for 10,000 genes (we set the seed for the random number generator so that you can reproduce exactly the results). We define mean expression values in a ranging from 5 to 10, and we define differences in group means m to be a decreasing function of a. We then draw 2 observations for each group, with the variance also being a decreasing function of a. We have observed this sort of decrease in mean differences and variances in several real datasets, which is why we set up the simulation this way. Finally, we randomly set approximately 5% of the genes to be differentially expressed by adding 1/4 to the expression values in the first group. > > > > > > > > > > > > >
source("~/projects/l2e/1 R Routines/ebayes.l2e.R") source("~/projects/l2e/1 R Routines/mpdc.R") set.seed(1) n sam.x qqt(tstat, df = nu) > plot(ebayes.fit$tstat, ebayes.fit$wl2e.pde, xlab = "Test statistic", + ylab = "Posterior prob. of DE")
References [1] C. Genovese and L. Wasserman. Operating characteristics and extensions of the false discovery rate procedure. Journal of the Royal Statistical Society B, 64:499–518, 2002. [2] P. M¨ uller, G. Parmigiani, C. Robert, and J. Rousseau. Optimal sample size for multiple testing: the case of gene expression microarrays. Journal of the American Statistical Association, 99:990–1001, 2004. [3] G.K. Smyth. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, 2004. [4] G.K. Smyth. Limma: linear models for microarray data. In R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, and W. Huber, editors, Bioinformatics
and Computational Biology Solutions using R and Bioconductor, pages 397–420. Springer, New York, 2005.