Lecture

122 downloads 197 Views 2MB Size Report
... Differential expression at the gene level. Michael Love. Computational Molecular Biology. MPI-MG, Berlin. Bloom and Fawcett, A Textbook of Histology.
RNA-Seq: Differential expression at the gene level

Bloom and Fawcett, A Textbook of Histology.

Michael Love

Computational Molecular Biology

MPI-MG, Berlin

RNA-Seq at the gene level



compare within sample, across genes



compare within gene, across samples –

gene-specific biases: length, GC-content

Some early papers ●





Sultan, Schulz, Richard, et al. “A Global View ...” Science (2008) Mortazavi, Williams, et al. “Mapping and quantifying ...” Nature Methods (2008) Marioni, “RNA-seq: An assessment...” Genome Research (2008)

Sultan, Schulz, Richard (2008) ●



RNA-Seq can detect more lowly expressed genes than microarray

“taking into account the theoretical number of unique 27-mers contained in each exon and the total number of reads generated in each experiment”

Mortazavi, Williams (2008) ●

“The sensitivity of RNA-Seq will be a function of both molar concentration and transcript length. We therefore quantified transcript levels in reads per kilobase of exon model per million mapped reads (RPKM)”

Mortazavi, Williams (2008)

Mortazavi, Williams (2008) ●

“The strength of evidence for detecting any given rare transcript with RNA-Seq, especially if it has garnered multiple unique sequence reads, may be considerably greater than that provided by microarrays, because array fluorescence signals from a low-abundance true positive can be very difficult to distinguish, numerically and statistically, from background array signals due to cross-hybridization and dye binding.”

Mortazavi, Williams (2008) ●

“This means that it is theoretically possible to sequence the contents of a single cell with minimal prior amplification, though the technical challenges to implementation are considerable.”

Marioni (2008)





“only a small proportion of genes (~0.5%) show strong evidence for a lane effect (i.e., extraPoisson variation)” more on this later

Outline



Gene counting and count distributions



Model for differential expression analysis

Gene-level counts ●

htseq-count (EMBL)



Cufflinks (UMD)



summarizeOverlaps/GenomicRanges (Bioc.)



featureCounts (WEHI)

Gene-level counts

{A},{A},...,{},{},...

{A},{A},...,{A,B},{A,B},...

Gene-level counts ●

What about isoforms?



Trapnell, Nature Biotech 2013



The average human gene has 8.8 exons (IHGSC)

Gene count table

sample1 sample2 sample3 gene1 0 0 0 gene2 0 12 1 gene3 1000 2000 100 gene4 10 20 2

ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets (google “recount rna”)

Statistical modeling



Why use statistics? – –

Estimate differences (fold changes) How many will validate? (FDR)

reminder: FDR Benjamini, Yoav and Hochberg, Yosef, "Controlling the false discovery rate: a practical and powerful approach to multiple testing" (1995) truly not DE truly DE not signif.

TN

FN

signif.

FP

TP

False discovery rate < E(FP / FP + TP) False positive rate = type I error = (1 − specificity) = FP / (FP + TN)

Non-parametric tests ●







Wilcoxon/Mann-Whitney rank sum test Li and Tibshirani have generalized for RNA-Seq: SAMseq() in samr package on CRAN Down-sampling to account for sequencing depth 5 samples per class: “it is unable to differentiate among the most significant features (top ~2,000 ... and top ~600 ...), as there are too few possible values of the Wilcoxon statistic”

RNA-Seq as draws from infinite urn ●





Imagine taking N colored balls from an urn which contains >> N balls The colors are genes, and the balls are fragments in the library A column of the count matrix is then multinomial(N,p)

Binomial ●





Focusing on a single gene (color), the counts are binomial(N,p) Suppose N*p stays constant as N => ∞ and p => 0 Poisson(N*p)

Binomial N=100, p=1/10

Poisson properties ●



Binomial(N, p) + Binomial(M, p) = Binomial(N + M, p) Poisson(N*p) + Poisson(M*p) = Poisson(N*p + M*p)



One parameter, lambda = E(X)



X ~ Poisson, Var(X) = E(X)

Negative binomial ●







Related distribution with two parameters The number of failures which occur in a sequence of Bernoulli trials before a target number of successes is reached Not helpful for us, instead use mean μ and dispersion α Var(X) = μ + α μ2

Negative binomial

Negative binomial

Poisson mixture

gamma

log normal

More than one Poisson mixture

Why not transform data? ●

“hetero-skedasticity”



different + scattering





Different variance for different sub-groups of data Typically low vs high mean

Variance of transformed counts

Relative power of tests

Bayesian approaches ●

False discovery rate < E(FP / FP + TP)



Probability of DE



baySeq (2010), BitSeq (2012), EBSeq (2013)



Good for both: – –

estimation of effect how many will validate

Generalized linear model ●



Linear model Nelder and Wedderburn "Generalized Linear Models" J of the Royal Stat Soc (1972)



McCullagh and Nelder (1989)



Wikipedia

Generalized linear model ●

Target y is a distribution (exponential family), e.g.:



y in {-∞,∞} : Normal



y in {0,1} : Bernoulli/binomial/logistic regression



y in {0,1,2,...} : Poisson / negative binomial



y in (0,∞) : gamma

Generalized linear model ●

“Link” function connects target y to the predictor variables X



Poisson : log



Bernoulli : logit

Generalized linear model ●



The target can then described as the observation of a probabilistic model: P(y|θ) If we are trying to find θ, we call this a likelihood: L(θ|y) = P(y|θ)

GLM for RNA-Seq

GLM for RNA-Seq ●



Log link function means effects are multiplicative Additive effects could lead to negative expected counts

Design matrix X 1 1 1 1 1 1 1 1

* 0 0 0 0 1 1 1 1

= log2(mu)

b0 * b1 =

b0 b0 b0 b0 b0 b0 b0 b0

+ + + +

b1 b1 b1 b1

b0 = 1, b1 = 1

“factorial” / “crossed” design X 1 1 1 1 1 1 1 1

0 0 0 0 1 1 1 1

* 0 1 0 1 0 1 0 1

= log2(mu)

b0 b0 b0 b0 * b1 = b0 b0 b2 b0 b0 b0

+

b2

+ + + + +

b2 b1 b1 + b2 b1 b1 + b2

b0 = 1, b1 = 1, b2 = 2

factorial and numeric predictors X 1 1 1 1 1 1 1 1

0 1 2 3 0 1 2 3

0 0 0 0 1 1 1 1

*

b0 * b1 = b2

= log2(mu) b0 b0 b0 b0 b0 b0 b0 b0

+ + + + + + +

1*b1 2*b1 3*b1 + 1*b1 + 2*b1 + 3*b1 +

b2 b2 b2 b2

b0 = 1, b1 = 1, b2 = 2

Test for interactions ●

Time course example:



K ~ condition + time + condition:time

log counts

time

How to fit a GLM ●

size factor / normalization factor



dispersion parameter α



log2 fold changes β

Estimate size factors ●





Sum/mean count is problematic Robinson and Oshlack: “Trimmed mean of M values” (TMM) Anders and Huber: Median ratio of sample to pseudoreference

Size factors: median ratio method

2 1 1 1 4

4 1 9 1 2

4 9 3 4 4

8 9 3 4 8

4 3 3 2 4

0.50 0.33 0.33 0.50 1.00

s1 = 0.50

Dispersion parameter α ●



Can be fit independently from β Method of moments: α = (var - mean) / mean2



Coefficient of variation = SD/mean



One dimensional optimization of likelihood

log2 fold changes β ●





Fit using an iterative method find roots of the derivative of the log likelihood Multidimensional Newton's method xn+1 = xn - f(xn) / f'(xn)

GLM hypothesis testing ●

Want to test H0 : βcondition = 0



H0 is a nested hypothesis in Halt



Likelihood ratio test: D = -2 * log ( L0 / Lalt )



Under H0, D ~ χ2 with df = dfalt - df0

“MA”-plot

Dispersion plot

Information sharing

Efron and Morris “Stein's Paradox in Statistics” 1977



Add bias, reduce variance



“Shrinkage estimators”, “maximum a posteriori”

Information sharing “data from one gene are informative about the mean and variance of another gene”

“Analyzing ’omics data using hierarchical models” Hongkai Ji and X Shirley Liu, Nat Biotech 2010

Information sharing

Gene-sample-specific normalization factor / offset

Removing technical variability in RNA-seq data ... Kasper D. Hansen and Rafael A. Irizarry, Biostat 2011

Sensitivity to outliers

“Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data” Jun Li and Robert Tibshirani

Gene-level DE software ●

Non-parametric: SAMseq



Bayesian: baySeq*





Negative binomial: Cuffdiff, DESeq*, DSS*, edgeR* Transformed data: voom (limma)*

*