Time-dependent gaussian process regression and ...

3 downloads 0 Views 242KB Size Report
May 11, 2013 - machine learning. MIT Press, 2006. Oliver Stegle, Katherine J. Denby, Emma J. Cooke, David L. Wild, Zoubin Ghahramani, and Karsten M.
Time-dependent gaussian process regression and significance analysis for sparse time-series Markus Heinonen 1,2†, Olivier Guipaud 3 , Fabien Milliat 3 , Val´erie Buard 3 , B´eatrice Micheau 3 and Florence d’Alch´e-Buc 1,2 May 11, 2013

1

Introduction

Gaussian process regression (GPR) has been extensively used for modelling and differential testing of biological time-series measurements due ¨ o and to its robustness and interpretability (Aij¨ L¨ahdesm¨aki, 2009; Cooke et al., 2011). However, the standard gaussian process (Rasmussen and Williams, 2006) assumes stationary model dynamics and is a poor fit for common perturbation experiments, where we expect to see rapid changes after the perturbation and diminishing rate of state change as the cell returns back to a stable state. A common application of time-series measurements is the testing of significant difference between two time-serie profiles (Dudoit et al., 2002). The currently used two-sample differential tests, based on gaussian processes, focus on comparing model likelihoods over a subset of measured time-points (Stegle et al., 2010; Storey et al., 2005), and hence necessitate dense measurements to cover the time axis. We address these problems by proposing timedependent extensions to both gaussian process regression and significance analysis between time-series. We propose a time-dependent noise model and time-dependent covariance priors (Gibbs, 1997; Pacriorek and Schervish, 2004),

suitable for perturbation experiments. We utilise a novel model inference criteria for sparse measurements, which results in more informative models along time. We propose two novel differential tests for time-series, that both allow significance testing at non-observed time-points. We apply the extended GPR model for analysis of differential expression of irradiated human umbilical vein endothelial cell (HUVEC) transcriptomics dataset.

2

Time-dependent GPR

Let y = (yt1 , . . . , ytN ) ∈ RN be the vector of N noisy output signal measurements yt ∈ R at input time-points T = (t1 , . . . , tN ) ∈ RN + . In gene expression measurements the values yt correspond to expression levels of a particular gene at time t. In regression modelling we assume that a true model f (t) explains the observations through yt = f (t) + εt

under Gaussian time-dependent noise model εt ∼ N (0, σt2 ). We collect the time-dependent noise variances σt21 , . . . , σt2N into a diagonal covariance matrix Ω, and hence ε2t1 , . . . , ε2tN ∼ N (0, Ω). Gaussian process regression is a bayesian nonparametric and non-linear method for estimat∗1 ´ IBISC, Universite d’Evry Val d’Essonne, France ing the underlying model of such a system. A 2 LRI, INRIA-Saclay, Universit´e Paris Sud, France gaussian process is a generalisation of distribu3 LRTE, Institut de Radioprotection et de Sˆ uret´e tions to functions, where any subset of function Nucl´eaire (IRSN), France † evaluations is jointly gaussian (Rasmussen and [email protected]

Williams, 2006). A gaussian process represents a model f = (f (t1 ), . . . , f (tN? )) at time-points T? = (t1 , . . . , tN? ) as the posterior of the estimated true model given data f? |y ∼ N (µ? , Σ? ), defined by µ? = K?T (KT T + Ω)

−1

p(y|T, θ) ∼ N (0, KT T + Ω). y

Σ? = K?? − K?T (KT T + Ω)−1 KT ? . We denote this distribution as the predictive distribution, which gives the posterior of the model at arbitrary target time-points T? . The posterior of the true model is often visualized by sampling (dense) vectors ˆf directly from the predictive distribution, or by visualizing the mean model√µ? along with e.g. 95% confidence intervals ±2 diag Σ? . However, if we are interested in sampling from the estimated model with observational noise Ω, we use the distribution y? |f? ∼ N (µ? , Σ? + Ω) (Kirk and Stumpf, 2009). The model prior KT T over time-points T × T (resp. T? ) determines the model dynamics. We utilise a time-dependent RBF kernel prior  2 ! 0 t t K(t, t0 ) = σf2 exp − − , `(t) `(t0 ) where each time t is associated with a lengthscale (smoothness) function `(t) following an exponential decay `(t) = ` − (` − `min )e−ct , controlled by three hyperparameters: ` is the maximum length-scale, `min is the minimum length-scale (at time t = 0), and the curvature c controls how fast the function `(t) approaches its maximum value. The time-dependent kernel prior encodes our assumption that the model should grow smoother along time.

2.1

θ = (σf2 , Ω, `, `min , c). A standard approach for model inference of GP’s is to choose hyperparameters that maximise the marginal (log) likelihood (MLL)

Model inference

The GPR model is implicitly defined by the kernel prior and the noise KT T + Ω, which are in turn generated by the hyperparameters

The MLL optimization criteria naturally pursues a model with good fit and low complexity by regularising the determinant of the prior. However, the data fit is only evaluated at the observed timepoints. In case of uneven observation intervals or sparse data, the model tends to maximally uninformative estimates for time intervals not constrained by data. We propose evaluating the MLL at arbitrary timepoints T? by drawing noisy samples {ˆ yi } from the noisy posterior distribution N (µ? , Σ? + Ω). The samples converge into an expected marginal (log) likelihood (EMLL) Eyˆ ∼y? [p(ˆ y|T? , θ)] ∼ N (µ? , Σ? + K?? + 2Ω). The expected optimisation criteria uses the learned model with noise Σ? +Ω as another prior in addition to the prior K?? + Ω. Hence, we use the target time-points to place additional importance to the non-observed time intervals.

3

Detecting time intervals of differential expression

Next, we concern ourselves with determining whether two time-series yA and yB exhibit significant difference at some time-points {t? }. The time-series might correspond to a cell measured at conditions A and B (e.g. case and control), or time-series of two different genes. In both cases the analysis proceeds analogously. A standard approach is to compare the likelihood ratios (Storey et al., 2005) or bayes factors (Stegle et al., 2010) between the GPR model HS under null hypothesis (S) where data yS = (yA , yB ) is pooled, against the two individual GPR models HA and HB for case and control. Both approaches suffer from the fact that differential

Case Control Independent Shared

Static noise Static kernel −3.81 ± 12.2 −4.49 ± 15.5 −4.15 ± 12.4 −9.29 ± 23.1

Static noise Dynamic kernel −2.58 ± 12.1 −2.32 ± 10.1 −2.45 ± 11.5 −5.70 ± 22.7

Dynamic noise Static kernel −0.12 ± 15.8 4.66 ± 11.9 2.27 ± 13.8 2.60 ± 22.1

Dynamic noise Dynamic kernel 7.19 ± 11.0 6.93 ± 11.0 7.06 ± 11.0 4.98 ± 22.1

Expected model∗ −6.96 ± 22.2 −7.67 ± 26.4 −7.31 ± 24.3 −8.62 ± 25.4

Table 1: The averages and standard deviations of optimal marginal log likelihoods. ∗ The expected model is time-dependent, and optimises over the EMLL criteria. It is not directly comparable to the other columns.

expression can be compared only over observed time-points. We propose two novel likelihood ratios for arbitrary time-points T? : (i) expected likelihood ratio and the (ii) posterior concentration ratio. In the expected likelihood ratio we evaluate the likelihood ratio on a sample from the noisy posterior distribution y? , which converges into a ratio ˆ ∼ y|T? , θ) between expected likelihoods Eyˆ ∼y? p(ˆ N (µ? , Σ? + K?? + 2Ω). We effectively replace the observed data y with the expected data Ey = µ? given by GPR model, which we can generate at arbitrary time-points. Alternatively we propose the posterior concentration test that evaluates the ratio between the concentration (inverse of variance, i.e. confidence in the model) between the posteriors of the models between shared and independent hypotheses. The posterior concentration follows ˆ ∼ N (0, Σ? + Ω). The posEyˆ ∼y? p(ˆ y|T? , f? , θ) terior concentration evaluates the confidence in the resulting models under the two hypotheses.

single cell population just prior to experiments. Hence, in total there are 2 conditions measured in 3 replicates at 8 time-points for 283 genes. By visual analysis, a majority of the genes do not seem to exhibit differential expression under irradiation. We optimise the GPR models using gradient ascent with restarts.

4.1

Model inference performance

Table 1 summarises the MLL fits achieved by both stationary and time-dependent GPR models. Time-dependent GPR expands the space of possible prior kernels, and hence the values are comparable due to the same optimisation criteria. The benefit of both time-dependent noise models and time-dependent kernels is imminent with a total log-scale increase in MLL values of 11.2 and 14.3 for the two hypotheses (I) and (S). The expected model maximises the EMLL criteria and is hence not comparable. It tends to produce models with less emphasis on the model complexity term, and hence higher data fit. In the remaining experiments, we utilise the timedependent GPR. 4 Results The model fits between independent and We conduct exploratory experiments on an in- shared models hints that, on average, the inhouse dataset of human umbilical vein endothe- dependent model is more probable by approxilial cells (HUVEC) under irradiation. There mately 2 units on the log scale. are a total of N = 283 genes with transcriptomics profiles measured with qPCR under con4.2 Detecting differential intervals trol and under a single irradiation dose of 2Gy (case) at 0 hours with measurements at days While the dynamic GPR model with higher MLL (0.5, 1, 2, 3, 4, 7, 14, 21). We employ three bio- provides a more accurate model of the data, we logical replicate populations, separated from a desire to evaluate how the higher MLL’s trans-

Differential expression by time 150



125



Expected MLL ratio MLL ratio Noisy posterior concentration ratio Noisy posterior concentration ratio (expected model) Posterior concentration ratio

100

● ●





75



AUC 0.0783 0.0665 0.0439 0.0087 0.0094

Diff. genes 175 142 140 72 234

50

Table 2: Timepoints with differential expression.

25



5 0

Genes w/ log ratio > 1



Ratio Expected MLL NPC (exp. model) NPC PC MLL

Avg diff. time-points 22.1 18.8 12.4 2.5 2.7

Discussion and conclusions



In this paper we have proposed time-dependent extensions to GPR for sparse biological timeFigure 1: The distribution of counts of genes with series and to two-sample differential testing. The time-dependent GPR paired with the novel liketime-points with a log ratio ≥ 1. lihood ratio tests provides a robust differential analysis method between irradiated and control samples. We leave the biological analysis and late into more accurately separating the time- further interpretation as future work. series under irradiation and control. Figure 1 shows the number of genes with a log ratio above References one along time-axis with classic MLL, expected ¨ o and H. L¨ ahdesm¨ aki. Learning gene regulatory networks MLL, the posterior concentration (PC), and the T. Aij¨ from gene expression measurements using non-parametric noisy PC ratios (See Section 3). molecular kinetics. Bioinformatics, 25(22):2937–2944, 2009. While the classic MLL criteria finds most dif- Emma J. Cooke, Richard S. Savage, Paul D. W. Kirk, R. Darkins, and David L. Wild. Bayesian hierarchical clustering for ferential genes, it provides only point estimates. microarray time series data with replicates and outlier measurements. BMC Bioinformatics, 12:399, 2011. The EMLL criteria achieves highest separability, Sandrine Dudoit, Yee Hwa Yang, Matthew J Callow, and Terand finds higher differential expression between ence P Speed. Statistical methods for identifying differentially expressed genes in replicated cdna microarray experobservations (e.g. between 7 and 14 days). The Statistica sinica, 12(1):111–140, 2002. Figure 1 indicates that only 25 genes are immedi- M. iments. N. Gibbs. Bayesian Gaussian Processes for Regression and ately perturbed by irradiation with the number Classification. PhD thesis, University of Cambridge, 1997. increasing during the first week. The effects are Paul D. W. Kirk and Michael P. H. Stumpf. Gaussian process regression bootstrapping: exploring the effects of unhighest during the second week (75 genes) and certainty in time course data. Bioinformatics, 25(10):1300– 1306, 2009. rapidly decreasing afterwards. Remarkably, alC. Pacriorek and M. J. Schervish. Nonstationary covariance most no genes show differential expression after functions for gaussian process regression. In NIPS, volume 16. MIT Press, 2004. 3 weeks from the radiation dose. Rasmussen and K.I. Williams. Gaussian processes for Table 2 summarises the results from different C.E. machine learning. MIT Press, 2006. tests. The EMLL test finds a total of 175 genes Oliver Stegle, Katherine J. Denby, Emma J. Cooke, David L. Wild, Zoubin Ghahramani, and Karsten M. Borgwardt. A with differential expression on at least one timerobust bayesian two-sample test for detecting intervals of point. According to EMLL, an average gene is differential gene expression in microarray time series. J Comp Biol, 17(3):355–367, 2010. differentially expressed for 96 hours (data not John D. Storey, Wenzhong Xiao, Jeffrey T. Leek, Ronald G. shown). The high separability of the classic MLL Tompkins, and Ronald W. Davis. Significance analysis of time course microarray experiments. PNAS, 102(36):12837– hints to some false positives, especially at the 3 12842, 2005. week time-point. 1

2

3

4

7

14

Time (h)

21

24