Detecting DNA modifications from SMRT sequencing data by ... - PLOS

Detecting DNA modifications from SMRT sequencing data by modeling sequence context dependence of polymerase kinetic Zhixing Feng1,4 , Gang Fang2 , Jonas Korlach3 , Tyson Clark3 , Khai Luong3 , Xuegong Zhang1 , Wing Wong4,* , and Eric Schadt3,5,* 1

Tsinghua National Laboratory for Information Science and Technology, and Department of Automation, Tsinghua University, Beijing 100084, China; 2 Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN 55455; 3 Pacific Biosciences, 1380 Willow Road, Menlo Park, CA 94025; 4 Department of Statistics, Stanford University, Stanford, CA 94305; 5 Department of Genetics and Genomics Sciences, Mount Sinai School of Medicine, New York, NY 10029; * Correspondence: [email protected], [email protected]

1

Text S1 Hyperparameters estimation for alternative model In this section, we explain how to estimate hyperparameters by assuming control sample is available. As hierarchical model without control data is basically a special case of hierarchical model with control data, one can simply remove y0 , µ0 and σ02 to get hyperparameter estimation when control sample is not available, and algorithm is unchanged. When alternative model is true (see section Hierarchical model with control data and Hierarchical model without control data in the main text), posterior distribution of µi and σi2 , i = 0, 1, ..., m, are[1]

p(µi |yi , σi2 , θ, κ) = N (

ni σi2 κ θ+ yi , ) κ + ni κ + ni κ + ni

p(σi2 |yi , υ, τ 2 ) = scaled inverse − χ2 (υ + ni , σ ei2 )

where

σ ei2

1 2 κni 2 2 = υτ + (ni − 1)si + (y − θ) υ + ni κ + ni i

2

(1)

y¯i =

1 ni

Pni

j=1

yij and si =

1 ni −1

Pni

j=1 (yij

− y¯i )2 . Posterior distribution of σc2

is[1]

p(σc2 |yc , υ, τ 2 ) = scaled inverse − χ2 (υ + nc , σ ec2 )

where

σ ec2

1 2 2 υτ + (ni − 1)sc = υ + nc

We used posterior expectations of µi and σi2 as their estimations, which are

µî = E(µi |yi , θ, κ) =

κ ni θ+ y κ + ni κ + ni i

(2)

where i = 0, 1, ..., m.

σî2 = E(σi2 |yi , υ, τ 2 ) =

ni + υ σ ei2 ni + υ − 2

where i = c, 0, 1, ..., m. We estimate hyperparameters (θ, κ, υ, τ 2 , µc ) from the data by maximizing the marginal log-likelihood function, which is

3

L(yc , y0 , y1 , ..., ym ; θ, κ, υ, τ 2 , µc ) = log(p(yc , y0 , y1 , ..., ym |θ, κ, υ, τ 2 , µc )) = log(p(yc |θ, κ, υ, τ 2 , µc )) +

m X

log(p(yi |θ, κ, υ, τ 2 ))

i=0

where

υ υ + ni log(υτ 2 ) + log(Γ( )) 2 2 nc X υ + ni υ − log( (ycj − µc )2 + υτ 2 ) − log(Γ( )) 2 2 j=1

log(p(yc |θ, κ, υ, τ 2 , µc )) =

nc log(π) 2 m m X X 1 υ + ni υ 2 log(p(yi |θ, κ, υ, τ )) = [ log(κ) + log(Γ( )) + log(υτ 2 ) 2 2 2 i=0 i=0 −

1 υ υ + ni log(κ + ni ) − log(Γ( )) − log((υ + ni )e σi2 ) 2 2 2 ni − log(π)] 2

−

It is obvious that L(yc , y0 , y1 , ..., ym ; θ, κ, υ, τ 2 , µc ) can be maximized by setting µc =

1 nc

Pnc

j=1

ycj . However, it is difficult to get a close form of

(θ, κ, υ, τ 2 ) and we therefore adopted a EM algorithm (Algorithm 1) to esti-

4

mate them numerically. In the EM procedure, µi and σi2 were regarded as missing data, and the log-likelihood function with the complete data was

l(y, µ, σ 2 ; θ, κ, τ 2 , υ) = log(p(y|µ, σ 2 )) + log(p(µ|σ 2 , θ, κ)) + log(p(σ 2 |υ, τ 2 ))) 2

= log(p(y|µ, σ )) +

m X

log(p(µi |σi2 , θ, κ)) +

i=0

X

log(p(σi2 |τ 2 , υ))

(3)

i=c,0,1,...,m

2 where y = (yc , y0 , y1 , ..., ym ), µ = (µ0 , µ1 , ..., µm ) and σ 2 = (σc2 , σ02 , σ12 , ..., σm ).

Initial values (θ0 , κ0 , τ02 , υ0 ) were assigned to (θ, κ, τ 2 , υ), and in the tth step (t ≥ 1) of EM algorithm, (θ, κ, τ 2 , υ) were updated by the optimal value 2 2 (θopt , κopt , τopt , υopt ) maximizing E l(y, µ, σ 2 ; θ, κ, τ 2 , υ) | y, θt−1 , κt−1 , τt−1 , υt−1 , where E(.) is expectation in term of (µ, σ 2 ), i.e. θt = θopt , κt = κopt , τt = 2 , υt = υopt . This procedure was repeated until convergence. τopt 2 In the tth step of the EM algorithm, (θopt , κopt ) and (τopt , υopt ) can be ob-

tained by maximizing posterior expectation of the second term and the third term of equation (3), i.e. and

Pm

i=0

2 E log(p(µi |σi2 , θ, κ)) | y, θt−1 , κt−1 , τt−1 , υt−1

2 2 2 E log(p(σ |τ , υ)) | y, θ , κ , τ , υ , respectively. t−1 t−1 t−1 i t−1 i=c,0,1,...,m

P

1

2 2 ∆l = E l(y, µ, σ 2 ; θopt , κopt , τopt , υopt ) | y, θt−1 , κt−1 , τt−1 , υt−1 2 2 2 E l(y, µ, σ ; θt−1 , κt−1 , τt−1 , υt−1 ) | y, θt−1 , κt−1 , τt−1 , υt−1

5

−

Algorithm 1 EM algorithm for hyperparameters estimation Assign initial values to hyperparameters, θ = θ0 , κ = κ0 , υ = υ0 and τ 2 = τ02 . Set t = 1 while ∆l ≤ 0.1 1 do E-step: Calculate conditional expectation of log likelihood function in 2 , υt−1 . terms of µi and σi2 , i.e. E l(y, µ, σ 2 ; θ, κ, τ 2 , υ) | y, θt−1 , κt−1 , τt−1 M-step: Update (θ, κ, υ, τ 2 ) by setting θt = θopt , κt = κopt , υt = 2 2 υopt , τt = τopt , where (θopt , κopt , υopt , τopt ) are hyperparameters maximiz 2 2 2 , υt−1 ing E l(y, µ, σ ; θ, κ, τ , υ) | y, θt−1 , κt−1 , τt−1 Set t = t + 1 end while Estimating θ and κ By taking posterior expectation of the second term of equation (3), we can get

m X


i=0 m

(θ − µ î(t−1) )2 m + 1 1 κX + log(κ) + C − = − 2 2 i=0 κt−1 + ni σ ei(t−1) 2 2 2 , υt−1 ) and µ î(t−1) are estimated σ ei2 and µi given (θt−1 , κt−1 , τt−1 where σ ei(t−1)

(equation (1) and (2)), and C is a constant, which doesn’t contain any hyperparameters.

Pm

i=0

2 E log(p(µi |σi2 , θ, κ)) | y, θt−1 , κt−1 , τt−1 , υt−1 can be

6

maximized by θopt and κopt , which are

θopt

m m X µ î(t−1) X 1 / = 2 σ e σ e2 i(t−1) i=0 i(t−1) i=0

κopt = (m + 1)/

m X i=0

Estimating τ 2 and υ

(θopt − µ î(t−1) )2 1 + 2 κt−1 + ni σ ei(t−1)

By taking posterior expectation of the third term

of equation (3), we can get

X

2 E log(p(σi2 |τ 2 , υ)) | y, θt−1 , κt−1 , τt−1 , υt−1

i=c,0,1,...,m

=

X υτ 2 υ (m + 2)υ 2 log( ) − ( + 1) E log(σi2 ) | y, θt−1 , κt−1 , τt−1 , υt−1 − 2 2 2 i=c,0,1,...,m τ 2υ 2

X i=c,0,1,...,m

E

1 υ 2 | y, θt−1 , κt−1 , τt−1 , υt−1 − (m + 2) log(Γ( )) 2 σi 2

2 X (υt−1 + ni )e σi(t−1) υτ 2 υ υt−1 + ni (m + 2)υ log( ) − ( + 1) (log( ) − ψ( )) − = 2 2 2 2 2 i=c,0,1,...,m

τ 2υ 2

X

1

σ e2 i=c,0,1,...,m i(t−1)

υ − (m + 2) log(Γ( )) 2

where Γ(.) is gamma function and ψ(.) is digamma function. By setting

7

   

∂ ∂τ 2

P

i=c,0,1,...,m

2 E log(p(σi2 |τ 2 , υ)) | y, θt−1 , κt−1 , τt−1 , υt−1 = 0

  

∂ ∂ υ2

P

i=c,0,1,...,m

2 E log(p(σi2 |τ 2 , υ)) | y, θt−1 , κt−1 , τt−1 , υt−1 = 0

we got    

(m+2)υ 2τ 2

−

υ 2

P

1 2 i=c,0,1,...,m σ ei(t−1)

=0

2 P    (m + 2) log( υ ) + log(τ 2 ) − ψ( υ ) − i=c,0,1,...,m (log( (υt−1 +ni )eσi(t−1) ) − ψ( υt−1 +ni )) = 0 2 2 2 2

we got close form solution of the above equations by using approximation of digamma function, which is ψ( υ2 ) ≈ log( υ2 ) −

2 τopt = P

υopt =

where T =

1 m+2

P

i=c,0,1,...,m

log(

1 υ

−

1 . 3υ 2

m+2 1 2 i=c,0,1,...,m σ ei(t−1)

2 q 3( 1 + 34 T − 1) 2 (υt−1 +ni )e σi(t−1)

2

2 ) − ψ( υt−12+ni ) − log(τopt ).

Hyperparameters estimation for null model For hierarchical model with control data, we denote pooled Box-Cox transformed IPD of native and control data as yp (i.e. (yc1 , yc2 , ..., ycnc , y01 , y02 , ..., y0n0 )). For hierarchical model without control data, we simply let yc , Box-Cox transformed IPD of native sample, equal to yp , because it is a special case of 8

hierarchical model with control data, in which y0 is empty. We assume yp follows a normal distribution

yp ∼ N (µp , σp2 )

2 and (µp , σp2 ), (µ1 , σ12 ), (µ2 , σ22 ), ..., (µm , σm ) have the same prior distribution,

which is

p(σi2 |υ, τ 2 ) = scaled inverse − χ2 (υ, τ 2 ) p(µi |σi2 , θ, κ) = N (θ,

σi2 ) κ

where, i = p, 1, ..., m. posterior distribution of µi and σi2 , i = p, 1, ..., m, are

p(µi |yi , σi2 , θ, κ) = N (

κ ni σi2 θ+ yi , ) κ + ni κ + ni κ + ni

p(σi2 |yi , υ, τ 2 ) = scaled inverse − χ2 (υ + ni , σ ei2 )

where

σ ei2 =

1 2 κni υτ + (ni − 1)s2i + (y i − θ)2 υ + ni κ + ni

9

[1]. We used posterior expectations of µi and σi2 as their estimations, which are

κ ni y θ+ κ + ni κ + ni i ni + υ σî2 = E(σi2 |yi , υ, τ 2 ) = σ e2 ni + υ − 2 i µî = E(µi |yi , θ, κ) =

where i = p, 1, ..., m. Like the previous section, we adopt a EM algorithm (Algorithm 1) to maximize marginal log-likelihood function L(yp , y1 , ..., ym ; θ, κ, υ, τ 2 ). In the EM procedure, µi and σi2 were regarded as missing data, and the loglikelihood function with the complete data was

l(y, µ, σ 2 ; θ, κ, τ 2 , υ) = log(p(y|µ, σ 2 )) +

X

log(p(µi |σi2 , θ, κ))

i=p,1,...,m

+

X

log(p(σi2 |τ 2 , υ))

i=p,1,...,m

2 where y = (yp , y1 , ..., ym ), µ = (µp , µ1 , ..., µm ) and σ 2 = (σp2 , σ12 , ..., σm ).

10

(4)

Estimating θ and κ By taking posterior expectation of the second term of equation (4), we can get

X


i=p,1,...,m

= −

Thus,

(θ − µ î(t−1) )2 m + 1 1 κ X − + log(κ) + C 2 2 i=p,1,...,m κt−1 + ni σ ei(t−1) 2

P

i=p,1,...,m

2 E log(p(µi |σi2 , θ, κ)) | y, θt−1 , κt−1 , τt−1 , υt−1 can be maxi-

mized by

m

θopt

µ î(t−1) X 1 / = σ e2 σ e2 i=p,1,...,m i(t−1) i=0 i(t−1) X

κopt = (m + 1)/

X i=p,1,...,m

Estimating τ 2 and υ

(θopt − µ î(t−1) )2 1 + 2 κt−1 + ni σ ei(t−1)

By taking posterior expectation of the third term

of equation (4), we can get

X

2 , υt−1 E log(p(σi2 |τ 2 , υ)) | y, θt−1 , κt−1 , τt−1

i=p,1,...,m 2 X (υt−1 + ni )e σi(t−1) (m + 1)υ υτ 2 υ υt−1 + ni = log( ) − ( + 1) (log( ) − ψ( ))− 2 2 2 2 2 i=p,1,...,m

τ 2υ 2

X

1

σ e2 i=p,1,...,m i(t−1)

υ − (m + 1) log(Γ( )) 2

11

(5)

By using approximation of digamma function, which is ψ( υ2 ) ≈ log( υ2 ) − 1 υ

−

1 , 3υ 2

(5) can be maximized by

2 τopt = P

υopt =

where T =

1 m+1

P

i=p,1,...,m

log(

m+1 1 2 i=p,1,...,m σ ei(t−1)

2 q 3( 1 + 43 T − 1)

2 (υt−1 +ni )e σi(t−1)

2

2 ). ) − ψ( υt−12+ni ) − log(τopt

References [1] Gelman A, Carlin JB, Stern HS, Rubin DB (2003) Bayesian Data Analysis. Chapman and Hall/CRC, 2nd edition.

12

Detecting DNA modifications from SMRT sequencing data by ... - PLOS

Detecting DNA modifications from SMRT sequencing data by ... - PLOS

Suggest Documents

Detecting DNA Modifications from SMRT Sequencing Data by ... - PLOS

DNA Methylation Assessed by SMRT Sequencing Is ... - CiteSeerX

Detecting DNA cytosine methylation using nanopore sequencing

SMRT Sequencing of Long Tandem Nucleotide Repeats in ... - PLOS

Detecting Alu insertions from high-throughput sequencing data

DNA sequence analysis: From DNA sequencing to

The advantages of SMRT sequencing - BioMedSearch

SMRT sequencing reveals a transferrable ...

DNA Sequencing-By-Hybridization - CiteSeerX

Exome Sequencing from Nanogram Amounts of Starting DNA - PLOS

Sequencing Degraded DNA from Non-Destructively Sampled ... - PLOS

On site DNA barcoding by nanopore sequencing - PLOS

SMRT Sequencing Revealed Mitogenome ... - Semantic Scholar

SMRT Sequencing for Parallel Analysis of Multiple

DNA Methylation Profiling from High-Throughput Sequencing Data

(SNPs): DNA Sequencing Pipeline Genome Q - PLOS

Incorporating DNA Sequencing into Current Prenatal ... - PLOS

DNA Sequencing:

Mutations by Next Generation Sequencing in Stool DNA from ...

Effect of DNA Modifications on DNA Processing by ... - Semantic Scholar

Effect of DNA Modifications on DNA Processing by

Chicken Caecal Microbiome Modifications Induced by ... - PLOS

Real-Time DNA Sequencing from Single Polymerase

Chicken Caecal Microbiome Modifications Induced by ... - PLOS