Lecture

RNA-Seq: Differential expression at the gene level

Bloom and Fawcett, A Textbook of Histology.

Michael Love

Computational Molecular Biology

MPI-MG, Berlin

RNA-Seq at the gene level

●

compare within sample, across genes

●

compare within gene, across samples –

gene-specific biases: length, GC-content

Some early papers ●

●

●

Sultan, Schulz, Richard, et al. “A Global View ...” Science (2008) Mortazavi, Williams, et al. “Mapping and quantifying ...” Nature Methods (2008) Marioni, “RNA-seq: An assessment...” Genome Research (2008)

Sultan, Schulz, Richard (2008) ●

●

RNA-Seq can detect more lowly expressed genes than microarray

“taking into account the theoretical number of unique 27-mers contained in each exon and the total number of reads generated in each experiment”

Mortazavi, Williams (2008) ●

“The sensitivity of RNA-Seq will be a function of both molar concentration and transcript length. We therefore quantified transcript levels in reads per kilobase of exon model per million mapped reads (RPKM)”

Mortazavi, Williams (2008)


“The strength of evidence for detecting any given rare transcript with RNA-Seq, especially if it has garnered multiple unique sequence reads, may be considerably greater than that provided by microarrays, because array fluorescence signals from a low-abundance true positive can be very difficult to distinguish, numerically and statistically, from background array signals due to cross-hybridization and dye binding.”


“This means that it is theoretically possible to sequence the contents of a single cell with minimal prior amplification, though the technical challenges to implementation are considerable.”

Marioni (2008)

●

●

“only a small proportion of genes (~0.5%) show strong evidence for a lane effect (i.e., extraPoisson variation)” more on this later

Outline

●

Gene counting and count distributions

●

Model for differential expression analysis

Gene-level counts ●

htseq-count (EMBL)

●

Cufflinks (UMD)

●

summarizeOverlaps/GenomicRanges (Bioc.)

●

featureCounts (WEHI)

Gene-level counts

{A},{A},...,{},{},...

{A},{A},...,{A,B},{A,B},...

Gene-level counts ●

What about isoforms?

●

Trapnell, Nature Biotech 2013

●

The average human gene has 8.8 exons (IHGSC)

Gene count table

sample1 sample2 sample3 gene1 0 0 0 gene2 0 12 1 gene3 1000 2000 100 gene4 10 20 2

ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets (google “recount rna”)

Statistical modeling

●

Why use statistics? – –

Estimate differences (fold changes) How many will validate? (FDR)

reminder: FDR Benjamini, Yoav and Hochberg, Yosef, "Controlling the false discovery rate: a practical and powerful approach to multiple testing" (1995) truly not DE truly DE not signif.

TN

FN

signif.

FP

TP

False discovery rate < E(FP / FP + TP) False positive rate = type I error = (1 − specificity) = FP / (FP + TN)

Non-parametric tests ●

●

●

●

Wilcoxon/Mann-Whitney rank sum test Li and Tibshirani have generalized for RNA-Seq: SAMseq() in samr package on CRAN Down-sampling to account for sequencing depth 5 samples per class: “it is unable to differentiate among the most signiﬁcant features (top ~2,000 ... and top ~600 ...), as there are too few possible values of the Wilcoxon statistic”

RNA-Seq as draws from infinite urn ●

●

●

Imagine taking N colored balls from an urn which contains >> N balls The colors are genes, and the balls are fragments in the library A column of the count matrix is then multinomial(N,p)

Binomial ●

●

●

Focusing on a single gene (color), the counts are binomial(N,p) Suppose N*p stays constant as N => ∞ and p => 0 Poisson(N*p)

Binomial N=100, p=1/10

Poisson properties ●

●

Binomial(N, p) + Binomial(M, p) = Binomial(N + M, p) Poisson(N*p) + Poisson(M*p) = Poisson(N*p + M*p)

●

One parameter, lambda = E(X)

●

X ~ Poisson, Var(X) = E(X)

Negative binomial ●

●

●

●

Related distribution with two parameters The number of failures which occur in a sequence of Bernoulli trials before a target number of successes is reached Not helpful for us, instead use mean μ and dispersion α Var(X) = μ + α μ2

Negative binomial

Negative binomial

Poisson mixture

gamma

log normal

More than one Poisson mixture

Why not transform data? ●

“hetero-skedasticity”

●

different + scattering

●

●

Different variance for different sub-groups of data Typically low vs high mean

Variance of transformed counts

Relative power of tests

Bayesian approaches ●

False discovery rate < E(FP / FP + TP)

●

Probability of DE

●

baySeq (2010), BitSeq (2012), EBSeq (2013)

●

Good for both: – –

estimation of effect how many will validate

Generalized linear model ●

●

Linear model Nelder and Wedderburn "Generalized Linear Models" J of the Royal Stat Soc (1972)

●

McCullagh and Nelder (1989)

●

Wikipedia


Target y is a distribution (exponential family), e.g.:

●

y in {-∞,∞} : Normal

●

y in {0,1} : Bernoulli/binomial/logistic regression

●

y in {0,1,2,...} : Poisson / negative binomial

●

y in (0,∞) : gamma


“Link” function connects target y to the predictor variables X

●

Poisson : log

●

Bernoulli : logit


●

The target can then described as the observation of a probabilistic model: P(y|θ) If we are trying to find θ, we call this a likelihood: L(θ|y) = P(y|θ)

GLM for RNA-Seq

GLM for RNA-Seq ●

●

Log link function means effects are multiplicative Additive effects could lead to negative expected counts

Design matrix X 1 1 1 1 1 1 1 1

* 0 0 0 0 1 1 1 1

= log2(mu)

b0 * b1 =

b0 b0 b0 b0 b0 b0 b0 b0

+ + + +

b1 b1 b1 b1

b0 = 1, b1 = 1

“factorial” / “crossed” design X 1 1 1 1 1 1 1 1

0 0 0 0 1 1 1 1

* 0 1 0 1 0 1 0 1

= log2(mu)

b0 b0 b0 b0 * b1 = b0 b0 b2 b0 b0 b0

+

b2

+ + + + +

b2 b1 b1 + b2 b1 b1 + b2

b0 = 1, b1 = 1, b2 = 2

factorial and numeric predictors X 1 1 1 1 1 1 1 1

0 1 2 3 0 1 2 3

0 0 0 0 1 1 1 1

*

b0 * b1 = b2

= log2(mu) b0 b0 b0 b0 b0 b0 b0 b0

+ + + + + + +

1*b1 2*b1 3*b1 + 1*b1 + 2*b1 + 3*b1 +

b2 b2 b2 b2

b0 = 1, b1 = 1, b2 = 2

Test for interactions ●

Time course example:

●

K ~ condition + time + condition:time

log counts

time

How to fit a GLM ●

size factor / normalization factor

●

dispersion parameter α

●

log2 fold changes β

Estimate size factors ●

●

●

Sum/mean count is problematic Robinson and Oshlack: “Trimmed mean of M values” (TMM) Anders and Huber: Median ratio of sample to pseudoreference

Size factors: median ratio method

2 1 1 1 4

4 1 9 1 2

4 9 3 4 4

8 9 3 4 8

4 3 3 2 4

0.50 0.33 0.33 0.50 1.00

s1 = 0.50

Dispersion parameter α ●

●

Can be fit independently from β Method of moments: α = (var - mean) / mean2

●

Coefficient of variation = SD/mean

●

One dimensional optimization of likelihood

log2 fold changes β ●

●

●

Fit using an iterative method find roots of the derivative of the log likelihood Multidimensional Newton's method xn+1 = xn - f(xn) / f'(xn)

GLM hypothesis testing ●

Want to test H0 : βcondition = 0

●

H0 is a nested hypothesis in Halt

●

Likelihood ratio test: D = -2 * log ( L0 / Lalt )

●

Under H0, D ~ χ2 with df = dfalt - df0

“MA”-plot

Dispersion plot

Information sharing

Efron and Morris “Stein's Paradox in Statistics” 1977

●

Add bias, reduce variance

●

“Shrinkage estimators”, “maximum a posteriori”

Information sharing “data from one gene are informative about the mean and variance of another gene”

“Analyzing ’omics data using hierarchical models” Hongkai Ji and X Shirley Liu, Nat Biotech 2010

Information sharing

Gene-sample-specific normalization factor / offset

Removing technical variability in RNA-seq data ... Kasper D. Hansen and Rafael A. Irizarry, Biostat 2011

Sensitivity to outliers

“Finding consistent patterns: a nonparametric approach for identifying diﬀerential expression in RNA-Seq data” Jun Li and Robert Tibshirani

Gene-level DE software ●

Non-parametric: SAMseq

●

Bayesian: baySeq*

●

●

Negative binomial: Cuffdiff, DESeq*, DSS*, edgeR* Transformed data: voom (limma)*

*

Lecture

Lecture

Suggest Documents

Lecture

Lecture

Lecture

Lecture

lecture

Lecture

lecture

Lecture

Lecture

Lecture

INF3190 Group lecture Lecture #4

Lecture 11 - Stoichiometry Lecture 11 - Introduction Lecture 11 - The ...

Lecture 18 - Covalent Bonding Lecture 18 - Introduction Lecture 18 ...

Python Scientific lecture notes - Scipy Lecture Notes

Rallye lecture CE1 Rallye lecture CE1

Python Scientific lecture notes - Scipy Lecture Notes

Python Scientific lecture notes - Scipy Lecture Notes

Python Scientific lecture notes - Scipy Lecture Notes

Python Scientific lecture notes - Scipy Lecture Notes

lecture slides

Lecture notes

Lecture 2

Lecture 01

Lecture notes