... Differential expression at the gene level. Michael Love. Computational
Molecular Biology. MPI-MG, Berlin. Bloom and Fawcett, A Textbook of Histology.
RNA-Seq: Differential expression at the gene level
Bloom and Fawcett, A Textbook of Histology.
Michael Love
Computational Molecular Biology
MPI-MG, Berlin
RNA-Seq at the gene level
●
compare within sample, across genes
●
compare within gene, across samples –
gene-specific biases: length, GC-content
Some early papers ●
●
●
Sultan, Schulz, Richard, et al. “A Global View ...” Science (2008) Mortazavi, Williams, et al. “Mapping and quantifying ...” Nature Methods (2008) Marioni, “RNA-seq: An assessment...” Genome Research (2008)
Sultan, Schulz, Richard (2008) ●
●
RNA-Seq can detect more lowly expressed genes than microarray
“taking into account the theoretical number of unique 27-mers contained in each exon and the total number of reads generated in each experiment”
Mortazavi, Williams (2008) ●
“The sensitivity of RNA-Seq will be a function of both molar concentration and transcript length. We therefore quantified transcript levels in reads per kilobase of exon model per million mapped reads (RPKM)”
Mortazavi, Williams (2008)
Mortazavi, Williams (2008) ●
“The strength of evidence for detecting any given rare transcript with RNA-Seq, especially if it has garnered multiple unique sequence reads, may be considerably greater than that provided by microarrays, because array fluorescence signals from a low-abundance true positive can be very difficult to distinguish, numerically and statistically, from background array signals due to cross-hybridization and dye binding.”
Mortazavi, Williams (2008) ●
“This means that it is theoretically possible to sequence the contents of a single cell with minimal prior amplification, though the technical challenges to implementation are considerable.”
Marioni (2008)
●
●
“only a small proportion of genes (~0.5%) show strong evidence for a lane effect (i.e., extraPoisson variation)” more on this later
Outline
●
Gene counting and count distributions
●
Model for differential expression analysis
Gene-level counts ●
htseq-count (EMBL)
●
Cufflinks (UMD)
●
summarizeOverlaps/GenomicRanges (Bioc.)
●
featureCounts (WEHI)
Gene-level counts
{A},{A},...,{},{},...
{A},{A},...,{A,B},{A,B},...
Gene-level counts ●
What about isoforms?
●
Trapnell, Nature Biotech 2013
●
The average human gene has 8.8 exons (IHGSC)
Gene count table
sample1 sample2 sample3 gene1 0 0 0 gene2 0 12 1 gene3 1000 2000 100 gene4 10 20 2
ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets (google “recount rna”)
Statistical modeling
●
Why use statistics? – –
Estimate differences (fold changes) How many will validate? (FDR)
reminder: FDR Benjamini, Yoav and Hochberg, Yosef, "Controlling the false discovery rate: a practical and powerful approach to multiple testing" (1995) truly not DE truly DE not signif.
TN
FN
signif.
FP
TP
False discovery rate < E(FP / FP + TP) False positive rate = type I error = (1 − specificity) = FP / (FP + TN)
Non-parametric tests ●
●
●
●
Wilcoxon/Mann-Whitney rank sum test Li and Tibshirani have generalized for RNA-Seq: SAMseq() in samr package on CRAN Down-sampling to account for sequencing depth 5 samples per class: “it is unable to differentiate among the most significant features (top ~2,000 ... and top ~600 ...), as there are too few possible values of the Wilcoxon statistic”
RNA-Seq as draws from infinite urn ●
●
●
Imagine taking N colored balls from an urn which contains >> N balls The colors are genes, and the balls are fragments in the library A column of the count matrix is then multinomial(N,p)
Binomial ●
●
●
Focusing on a single gene (color), the counts are binomial(N,p) Suppose N*p stays constant as N => ∞ and p => 0 Poisson(N*p)
Binomial N=100, p=1/10
Poisson properties ●
●
Binomial(N, p) + Binomial(M, p) = Binomial(N + M, p) Poisson(N*p) + Poisson(M*p) = Poisson(N*p + M*p)
●
One parameter, lambda = E(X)
●
X ~ Poisson, Var(X) = E(X)
Negative binomial ●
●
●
●
Related distribution with two parameters The number of failures which occur in a sequence of Bernoulli trials before a target number of successes is reached Not helpful for us, instead use mean μ and dispersion α Var(X) = μ + α μ2
Negative binomial
Negative binomial
Poisson mixture
gamma
log normal
More than one Poisson mixture
Why not transform data? ●
“hetero-skedasticity”
●
different + scattering
●
●
Different variance for different sub-groups of data Typically low vs high mean
Variance of transformed counts
Relative power of tests
Bayesian approaches ●
False discovery rate < E(FP / FP + TP)
●
Probability of DE
●
baySeq (2010), BitSeq (2012), EBSeq (2013)
●
Good for both: – –
estimation of effect how many will validate
Generalized linear model ●
●
Linear model Nelder and Wedderburn "Generalized Linear Models" J of the Royal Stat Soc (1972)
●
McCullagh and Nelder (1989)
●
Wikipedia
Generalized linear model ●
Target y is a distribution (exponential family), e.g.:
●
y in {-∞,∞} : Normal
●
y in {0,1} : Bernoulli/binomial/logistic regression
●
y in {0,1,2,...} : Poisson / negative binomial
●
y in (0,∞) : gamma
Generalized linear model ●
“Link” function connects target y to the predictor variables X
●
Poisson : log
●
Bernoulli : logit
Generalized linear model ●
●
The target can then described as the observation of a probabilistic model: P(y|θ) If we are trying to find θ, we call this a likelihood: L(θ|y) = P(y|θ)
GLM for RNA-Seq
GLM for RNA-Seq ●
●
Log link function means effects are multiplicative Additive effects could lead to negative expected counts
Design matrix X 1 1 1 1 1 1 1 1
* 0 0 0 0 1 1 1 1
= log2(mu)
b0 * b1 =
b0 b0 b0 b0 b0 b0 b0 b0
+ + + +
b1 b1 b1 b1
b0 = 1, b1 = 1
“factorial” / “crossed” design X 1 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1
* 0 1 0 1 0 1 0 1
= log2(mu)
b0 b0 b0 b0 * b1 = b0 b0 b2 b0 b0 b0
+
b2
+ + + + +
b2 b1 b1 + b2 b1 b1 + b2
b0 = 1, b1 = 1, b2 = 2
factorial and numeric predictors X 1 1 1 1 1 1 1 1
0 1 2 3 0 1 2 3
0 0 0 0 1 1 1 1
*
b0 * b1 = b2
= log2(mu) b0 b0 b0 b0 b0 b0 b0 b0
+ + + + + + +
1*b1 2*b1 3*b1 + 1*b1 + 2*b1 + 3*b1 +
b2 b2 b2 b2
b0 = 1, b1 = 1, b2 = 2
Test for interactions ●
Time course example:
●
K ~ condition + time + condition:time
log counts
time
How to fit a GLM ●
size factor / normalization factor
●
dispersion parameter α
●
log2 fold changes β
Estimate size factors ●
●
●
Sum/mean count is problematic Robinson and Oshlack: “Trimmed mean of M values” (TMM) Anders and Huber: Median ratio of sample to pseudoreference
Size factors: median ratio method
2 1 1 1 4
4 1 9 1 2
4 9 3 4 4
8 9 3 4 8
4 3 3 2 4
0.50 0.33 0.33 0.50 1.00
s1 = 0.50
Dispersion parameter α ●
●
Can be fit independently from β Method of moments: α = (var - mean) / mean2
●
Coefficient of variation = SD/mean
●
One dimensional optimization of likelihood
log2 fold changes β ●
●
●
Fit using an iterative method find roots of the derivative of the log likelihood Multidimensional Newton's method xn+1 = xn - f(xn) / f'(xn)
GLM hypothesis testing ●
Want to test H0 : βcondition = 0
●
H0 is a nested hypothesis in Halt
●
Likelihood ratio test: D = -2 * log ( L0 / Lalt )
●
Under H0, D ~ χ2 with df = dfalt - df0
“MA”-plot
Dispersion plot
Information sharing
Efron and Morris “Stein's Paradox in Statistics” 1977
●
Add bias, reduce variance
●
“Shrinkage estimators”, “maximum a posteriori”
Information sharing “data from one gene are informative about the mean and variance of another gene”
“Analyzing ’omics data using hierarchical models” Hongkai Ji and X Shirley Liu, Nat Biotech 2010
Information sharing
Gene-sample-specific normalization factor / offset
Removing technical variability in RNA-seq data ... Kasper D. Hansen and Rafael A. Irizarry, Biostat 2011
Sensitivity to outliers
“Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data” Jun Li and Robert Tibshirani
Gene-level DE software ●
Non-parametric: SAMseq
●
Bayesian: baySeq*
●
●
Negative binomial: Cuffdiff, DESeq*, DSS*, edgeR* Transformed data: voom (limma)*
*