would like to thank Persi Diaconis for numerous references and helpful discussions, and. Sandrine Dudoit, Yee Yang, and Wolfgang Huber for making their ...
Asymmetric Laplace Distribution for Gene Expression Data Introduction
Possible Applications
The distribution of the normalized gene expressions, while similar for different arrays, is often far from normal regardless of the normalization methods. Rather, the distribution tends toward heavy tails and asymmetry of varying degrees. The Asymmetric Laplace distribution (Kotz et al., 2001) generalizes the heavy-tailed Laplace or double exponential distribution to allow for asymmetry and gives a good fit for gene expression data. Having a reasonable parametric model for the distribution with arrays allows more detailed examination of methods of analysis.
Non-parametric methods allow a great deal of flexibility in analysis. But ultimately, having a good parametric model gives the advantage of being able to systematically explore the theoretical behavior of random variables.
Asymmetric Laplace Distribution (AL(θ, µ, σ)) 2 κ f (y) = σ 1 + κ2 θ ⇒ location σ ⇒ scale
(
√ exp( −√σ2κ |x − θ|) exp( −σκ2 |x − θ|)
if x ≥ θ if x < θ
µ , κ ⇒ skewness √ 1 µ = σ( κ − κ)/ 2
κ > 0, µ ∈ R, where σ > 0, θ ∈ R
E(Y ) = θ + µ var(Y ) = σ 2 + µ2 Y −θ ∼ AL(0, κ, 1) σ
θˆ = µˆ = σˆ = where α(θ) = β(θ) =
p 1X arg min |Xi − θ| + 2 α(θ)β(θ) n ¯ − θˆ X q q √ q 4 ˆ θ) ˆ ˆ + β(θ) ˆ 2 α(θ)β( α(θ) 1X (Xi − θ)+ n 1X (Xi − θ)− n
−4
−2
0
2
4
Shown below is AL(θ, µ, σ) fit to data from Xu et al. (2003), which was normalized with vsn (Huber and Heydebreck, 2003; Durbin et al., 2002) and from Yang et al. (2002), self-self hybridizations normalized using loess smoothing. For arrays not shown, the fit is often comparable to those shown here. In particular, the AL(θ, µ, σ) distribution across the arrays improves upon the fit of the Normal in almost all of the arrays, and otherwise is only a slightly worse in fit than a Normal.
Q-Q Plots AL(θ, µ, σ)
Normal
T-Cell
Asym. Laplace Normal Spline Fit
−0.5
0.0
0.5
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.0
1.5
2. Log-ratio of two independent Pareto I random variables, P1 σ Yi = θ + √ log( ), P2 2 d
where P1 ∼ Pareto I(κ, 1), P2 ∼ Pareto I(1/κ, 1)
(2)
• cDNA arrays are often log-ratios of red and green channel (or approximately so for the VSN normalization) • Implies skewness ( κ term ) arises from a difference in distribution between the red and green channel • “Pareto-like” distribution found for mRNA expresion (Kuznetsov, 2001; Wu et al., 2003). • In the datasets we examined, the separate green and red channels did not seem to follow Pareto distribution. • Independence between the red and green channel is unlikely
Many intuitive notions of correlation implicitly depend on normal assumptions. But basic assumptions do not always hold for other distributions. For example X, Y ∼ F , they do not necessarily have ρmin = −1 unless F is symmetric (de Veaux, 1976) (where ρmin = min corr(X, Y ) over all possible bivariate distributions with marginals F ) X, Y AL(θ, µ, σ) marginals • Bivariate Asymmetric Laplace distribution given in Kotz et al. (2001) has minimum correlation ρBAL = 2 µ √ −1 2 µ +1
• Asymptotically, the ρmin → 1 − π 2/6 ≈ −0.645 and ρBAL → +1
However, in general, correlation not guaranteed to be on the same scale between different distribution and thus clustering with it could be problematic. More complicated situations than examined here, e.g. data with dependency, could be worse.
Maximum Likelihood Estimates of the parameters for microarray data
• the tn−1 distribution will be conservative for the t-statistic if data is symmetric L(θ, σ) (Lehmann, 1986; Sansing, 1976). Simulations show t-statistic follows tn−1 distribution well, even for n = 10 • For AL(θ, µ, σ), if µ is large relative to σ, then for small sample sizes the t-statistic will be skewed • For small κ seen in microarrays, the distribution still follows a tn−1 distribution fairly closely (simulations) • For the AL(θ, µ, σ) distribution, mean is µ + θ, not just θ. If the parameter of interest is θ (the location parameter), then the t-statistic affected by the skewness of the data. For a 2-sample t-statistic, if κ is the same then the t-statistic will still be centered at 0 under the null.
Conclusion The distribution of cDNA microarray expression in an array can plausibly be modeled as data from AL(θ, µ, σ). Using a parametric distribution allows for theoretical analysis not possible using just resampling techniques. The distribution of residual error allows for further interpretation and also analysis of normalization techniques. Furthermore, the interpretation of the AL(θ, µ, σ) gives reason to think that the distribution for a gene across arrays could be AL(θ, µ, σ) or normally distributed, allowing use on techniques of gene differentiation as well.
Acknowledgements Work supported by the NSF grant DMS 02-41246 and a Stanford Graduate fellowship. We would like to thank Persi Diaconis for numerous references and helpful discussions, and Sandrine Dudoit, Yee Yang, and Wolfgang Huber for making their Bioconductor packages freely available.
References
Correlation: How Useful?
• In the range of κ observed empirically, minimum possible correlation is close to −1 in both cases. Self-Self
−1.0
• Implies measured gene intensities within array have different variation across genes. Specifically normal with an exponential distribution for the variance and mean of the normal • Exponential has a memoryless, self-similar property; conditionally, large values are distributed the same as small values.
Possible Biological Relation:
cDNA Microarray Data
Histograms
(1)
Possible Biological Relation:
µ=0 µ = 0.5 µ=1 µ = 1.5 µ=2
Max. Likelihood Estimates (Kotz et al., 2001):
Clearly the gene expression is not i.i.d. However, if the assumption of normalization is true (no underlying difference between the two channels for most genes), we could hope that the dependency structure of the error (as measured in say log-ratios) is reduced. Regardless, the fit to AL(θ, µ, σ) is strong, and thus it is interesting look at possible interpretations.
Yi|Wi ∼ N (θ + µWi, σ 2Wi), where Wi ∼ exp(1)
Asymmetric Laplace (for varying µ)
ˆ κ
t Statistic for AL(θ, µ, σ) data
1. Continuous mixture of normal random variables, with dependent scale and mean parameters varying according to an exponential distribution
µ < 0 ⇔ κ > 1 : left skew µ > 0 ⇔ κ < 1 : right skew
σˆ Median Mean Ave. Dev. T-Cell 0.039 (0.0022) 1.174 (0.0068) 0.304 (0.0027) 0.006 0.221 Self-Self −0.001 (0.0026) 1.002 (0.0078) 0.243 ( 0.0031) −0.001 0.172 θˆ
Interpretation of Yi ∼ AL(θ, µ, σ)
X, Y ∼ AL(θ, µ, σ) ρmin and ρBAL
Minimal Possible Correlation −1.0 −0.6 −0.2 0.0
√
Elizabeth Purdom, Susan Holmes Department of Statistics Stanford University
All Joint Distributions Bivariate Laplace
1.0
1.1 1.2 1.3 1.4 1.5 Value of κ for X and Y
V EAUX , D. (1976). Tight upper and lower bounds for correlation of bivariate distributions arising in air pollution modeling. Technical Report 5, Stanford University, Stanford, California. D URBIN , B., H ARDIN , J., H AWKINS , D. and R OCKE , D. (2002). A variance-stabilizing transformation for gene-expression micoarray data. Bioinformatics 18 S105–S110. H UBER , W. and H EYDEBRECK , A. V. (2003). vsn package: Variance stabilization and calibration for microarray data. Bioconductor, http://www.r-project.org/. KOTZ , S., KOZUBOWSKI , T. and K RYSZTOF, P. (2001). The Laplace Distribution and Gener¨ Boston. alizations. Birkha, K UZNETSOV, V. A. (2001). Distribution associated with stochastic processes of gene expression in a single eukaryotic cell. Journal on Applied Signal Processing 4 285–296. L EHMANN , E. (1986). Testing Statistical Hypotheses. 2nd ed. Springer, New York. S ANSING , R. C. (1976). The t-statistic for a double exponential distribution. SIAM Journal on Applied Mathematics 31 634–645. W U, Z., I RIZARRY, R. A., G ENTLEMAN , R., M URILLO, F. M. and S PENCER , F. (2003). A model based background adjustment for oligonucleotide expression arrays. Tech. rep., Johns Hopkins University, Department of Biostatistics. X U, T., S HU, C.-T., P URDOM , E., DANG , D., I LSLEY, D., G UO, Y., H OLMES , S. P. and L EE , P. P. (2003). Subtle but consistent differences in gene expression of circulating CD8+ T cells in melanoma patients and healthy donors. To appear in Cancer Research. YANG , I. V., C HEN , E., H ASSEMAN , J. P., L IANG , W., F RANK , B. C., WANG , S., S HAROV, V., S AEED, A., W HITE , J., L I , J., L EE , N. H., Y EATMAN , T. J. and Q UACKENBUSH , J. (2002). Within the fold: Assessing differential expression measures and reproducibility in microarray assays. Genome Biology 3.
DE