Detecting DNA modifications from SMRT sequencing data by modeling sequence context dependence of polymerase kinetic Zhixing Feng1,4 , Gang Fang2 , Jonas Korlach3 , Tyson Clark3 , Khai Luong3 , Xuegong Zhang1 , Wing Wong4,* , and Eric Schadt3,5,* 1
Tsinghua National Laboratory for Information Science and Technology, and Department of Automation, Tsinghua University, Beijing 100084, China; 2 Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN 55455; 3 Pacific Biosciences, 1380 Willow Road, Menlo Park, CA 94025; 4 Department of Statistics, Stanford University, Stanford, CA 94305; 5 Department of Genetics and Genomics Sciences, Mount Sinai School of Medicine, New York, NY 10029; * Correspondence:
[email protected],
[email protected]
1
Text S1 Hyperparameters estimation for alternative model In this section, we explain how to estimate hyperparameters by assuming control sample is available. As hierarchical model without control data is basically a special case of hierarchical model with control data, one can simply remove y0 , µ0 and σ02 to get hyperparameter estimation when control sample is not available, and algorithm is unchanged. When alternative model is true (see section Hierarchical model with control data and Hierarchical model without control data in the main text), posterior distribution of µi and σi2 , i = 0, 1, ..., m, are[1]
p(µi |yi , σi2 , θ, κ) = N (
ni σi2 κ θ+ yi , ) κ + ni κ + ni κ + ni
p(σi2 |yi , υ, τ 2 ) = scaled inverse − χ2 (υ + ni , σ ei2 )
where
σ ei2
1 2 κni 2 2 = υτ + (ni − 1)si + (y − θ) υ + ni κ + ni i
2
(1)
y¯i =
1 ni
Pni
j=1
yij and si =
1 ni −1
Pni
j=1 (yij
− y¯i )2 . Posterior distribution of σc2
is[1]
p(σc2 |yc , υ, τ 2 ) = scaled inverse − χ2 (υ + nc , σ ec2 )
where
σ ec2
1 2 2 υτ + (ni − 1)sc = υ + nc
We used posterior expectations of µi and σi2 as their estimations, which are
µˆi = E(µi |yi , θ, κ) =
κ ni θ+ y κ + ni κ + ni i
(2)
where i = 0, 1, ..., m.
σˆi2 = E(σi2 |yi , υ, τ 2 ) =
ni + υ σ ei2 ni + υ − 2
where i = c, 0, 1, ..., m. We estimate hyperparameters (θ, κ, υ, τ 2 , µc ) from the data by maximizing the marginal log-likelihood function, which is
3
L(yc , y0 , y1 , ..., ym ; θ, κ, υ, τ 2 , µc ) = log(p(yc , y0 , y1 , ..., ym |θ, κ, υ, τ 2 , µc )) = log(p(yc |θ, κ, υ, τ 2 , µc )) +
m X
log(p(yi |θ, κ, υ, τ 2 ))
i=0
where
υ υ + ni log(υτ 2 ) + log(Γ( )) 2 2 nc X υ + ni υ − log( (ycj − µc )2 + υτ 2 ) − log(Γ( )) 2 2 j=1
log(p(yc |θ, κ, υ, τ 2 , µc )) =
nc log(π) 2 m m X X 1 υ + ni υ 2 log(p(yi |θ, κ, υ, τ )) = [ log(κ) + log(Γ( )) + log(υτ 2 ) 2 2 2 i=0 i=0 −
1 υ υ + ni log(κ + ni ) − log(Γ( )) − log((υ + ni )e σi2 ) 2 2 2 ni − log(π)] 2
−
It is obvious that L(yc , y0 , y1 , ..., ym ; θ, κ, υ, τ 2 , µc ) can be maximized by setting µc =
1 nc
Pnc
j=1
ycj . However, it is difficult to get a close form of
(θ, κ, υ, τ 2 ) and we therefore adopted a EM algorithm (Algorithm 1) to esti-
4
mate them numerically. In the EM procedure, µi and σi2 were regarded as missing data, and the log-likelihood function with the complete data was
l(y, µ, σ 2 ; θ, κ, τ 2 , υ) = log(p(y|µ, σ 2 )) + log(p(µ|σ 2 , θ, κ)) + log(p(σ 2 |υ, τ 2 ))) 2
= log(p(y|µ, σ )) +
m X
log(p(µi |σi2 , θ, κ)) +
i=0
X
log(p(σi2 |τ 2 , υ))
(3)
i=c,0,1,...,m
2 where y = (yc , y0 , y1 , ..., ym ), µ = (µ0 , µ1 , ..., µm ) and σ 2 = (σc2 , σ02 , σ12 , ..., σm ).
Initial values (θ0 , κ0 , τ02 , υ0 ) were assigned to (θ, κ, τ 2 , υ), and in the tth step (t ≥ 1) of EM algorithm, (θ, κ, τ 2 , υ) were updated by the optimal value 2 2 (θopt , κopt , τopt , υopt ) maximizing E l(y, µ, σ 2 ; θ, κ, τ 2 , υ) | y, θt−1 , κt−1 , τt−1 , υt−1 , where E(.) is expectation in term of (µ, σ 2 ), i.e. θt = θopt , κt = κopt , τt = 2 , υt = υopt . This procedure was repeated until convergence. τopt 2 In the tth step of the EM algorithm, (θopt , κopt ) and (τopt , υopt ) can be ob-
tained by maximizing posterior expectation of the second term and the third term of equation (3), i.e. and
Pm
i=0
2 E log(p(µi |σi2 , θ, κ)) | y, θt−1 , κt−1 , τt−1 , υt−1
2 2 2 E log(p(σ |τ , υ)) | y, θ , κ , τ , υ , respectively. t−1 t−1 t−1 i t−1 i=c,0,1,...,m
P
1
2 2 ∆l = E l(y, µ, σ 2 ; θopt , κopt , τopt , υopt ) | y, θt−1 , κt−1 , τt−1 , υt−1 2 2 2 E l(y, µ, σ ; θt−1 , κt−1 , τt−1 , υt−1 ) | y, θt−1 , κt−1 , τt−1 , υt−1
5
−
Algorithm 1 EM algorithm for hyperparameters estimation Assign initial values to hyperparameters, θ = θ0 , κ = κ0 , υ = υ0 and τ 2 = τ02 . Set t = 1 while ∆l ≤ 0.1 1 do E-step: Calculate conditional expectation of log likelihood function in 2 , υt−1 . terms of µi and σi2 , i.e. E l(y, µ, σ 2 ; θ, κ, τ 2 , υ) | y, θt−1 , κt−1 , τt−1 M-step: Update (θ, κ, υ, τ 2 ) by setting θt = θopt , κt = κopt , υt = 2 2 υopt , τt = τopt , where (θopt , κopt , υopt , τopt ) are hyperparameters maximiz 2 2 2 , υt−1 ing E l(y, µ, σ ; θ, κ, τ , υ) | y, θt−1 , κt−1 , τt−1 Set t = t + 1 end while Estimating θ and κ By taking posterior expectation of the second term of equation (3), we can get
m X
2 E log(p(µi |σi2 , θ, κ)) | y, θt−1 , κt−1 , τt−1 , υt−1
i=0 m
(θ − µ ˆi(t−1) )2 m + 1 1 κX + log(κ) + C − = − 2 2 i=0 κt−1 + ni σ ei(t−1) 2 2 2 , υt−1 ) and µ ˆi(t−1) are estimated σ ei2 and µi given (θt−1 , κt−1 , τt−1 where σ ei(t−1)
(equation (1) and (2)), and C is a constant, which doesn’t contain any hyperparameters.
Pm
i=0
2 E log(p(µi |σi2 , θ, κ)) | y, θt−1 , κt−1 , τt−1 , υt−1 can be
6
maximized by θopt and κopt , which are
θopt
m m X µ ˆi(t−1) X 1 / = 2 σ e σ e2 i(t−1) i=0 i(t−1) i=0
κopt = (m + 1)/
m X i=0
Estimating τ 2 and υ
(θopt − µ ˆi(t−1) )2 1 + 2 κt−1 + ni σ ei(t−1)
By taking posterior expectation of the third term
of equation (3), we can get
X
2 E log(p(σi2 |τ 2 , υ)) | y, θt−1 , κt−1 , τt−1 , υt−1
i=c,0,1,...,m
=
X υτ 2 υ (m + 2)υ 2 log( ) − ( + 1) E log(σi2 ) | y, θt−1 , κt−1 , τt−1 , υt−1 − 2 2 2 i=c,0,1,...,m τ 2υ 2
X i=c,0,1,...,m
E
1 υ 2 | y, θt−1 , κt−1 , τt−1 , υt−1 − (m + 2) log(Γ( )) 2 σi 2
2 X (υt−1 + ni )e σi(t−1) υτ 2 υ υt−1 + ni (m + 2)υ log( ) − ( + 1) (log( ) − ψ( )) − = 2 2 2 2 2 i=c,0,1,...,m
τ 2υ 2
X
1
σ e2 i=c,0,1,...,m i(t−1)
υ − (m + 2) log(Γ( )) 2
where Γ(.) is gamma function and ψ(.) is digamma function. By setting
7
∂ ∂τ 2
P
i=c,0,1,...,m
2 E log(p(σi2 |τ 2 , υ)) | y, θt−1 , κt−1 , τt−1 , υt−1 = 0
∂ ∂ υ2
P
i=c,0,1,...,m
2 E log(p(σi2 |τ 2 , υ)) | y, θt−1 , κt−1 , τt−1 , υt−1 = 0
we got
(m+2)υ 2τ 2
−
υ 2
P
1 2 i=c,0,1,...,m σ ei(t−1)
=0
2 P (m + 2) log( υ ) + log(τ 2 ) − ψ( υ ) − i=c,0,1,...,m (log( (υt−1 +ni )eσi(t−1) ) − ψ( υt−1 +ni )) = 0 2 2 2 2
we got close form solution of the above equations by using approximation of digamma function, which is ψ( υ2 ) ≈ log( υ2 ) −
2 τopt = P
υopt =
where T =
1 m+2
P
i=c,0,1,...,m
log(
1 υ
−
1 . 3υ 2
m+2 1 2 i=c,0,1,...,m σ ei(t−1)
2 q 3( 1 + 34 T − 1) 2 (υt−1 +ni )e σi(t−1)
2
2 ) − ψ( υt−12+ni ) − log(τopt ).
Hyperparameters estimation for null model For hierarchical model with control data, we denote pooled Box-Cox transformed IPD of native and control data as yp (i.e. (yc1 , yc2 , ..., ycnc , y01 , y02 , ..., y0n0 )). For hierarchical model without control data, we simply let yc , Box-Cox transformed IPD of native sample, equal to yp , because it is a special case of 8
hierarchical model with control data, in which y0 is empty. We assume yp follows a normal distribution
yp ∼ N (µp , σp2 )
2 and (µp , σp2 ), (µ1 , σ12 ), (µ2 , σ22 ), ..., (µm , σm ) have the same prior distribution,
which is
p(σi2 |υ, τ 2 ) = scaled inverse − χ2 (υ, τ 2 ) p(µi |σi2 , θ, κ) = N (θ,
σi2 ) κ
where, i = p, 1, ..., m. posterior distribution of µi and σi2 , i = p, 1, ..., m, are
p(µi |yi , σi2 , θ, κ) = N (
κ ni σi2 θ+ yi , ) κ + ni κ + ni κ + ni
p(σi2 |yi , υ, τ 2 ) = scaled inverse − χ2 (υ + ni , σ ei2 )
where
σ ei2 =
1 2 κni υτ + (ni − 1)s2i + (y i − θ)2 υ + ni κ + ni
9
[1]. We used posterior expectations of µi and σi2 as their estimations, which are
κ ni y θ+ κ + ni κ + ni i ni + υ σˆi2 = E(σi2 |yi , υ, τ 2 ) = σ e2 ni + υ − 2 i µˆi = E(µi |yi , θ, κ) =
where i = p, 1, ..., m. Like the previous section, we adopt a EM algorithm (Algorithm 1) to maximize marginal log-likelihood function L(yp , y1 , ..., ym ; θ, κ, υ, τ 2 ). In the EM procedure, µi and σi2 were regarded as missing data, and the loglikelihood function with the complete data was
l(y, µ, σ 2 ; θ, κ, τ 2 , υ) = log(p(y|µ, σ 2 )) +
X
log(p(µi |σi2 , θ, κ))
i=p,1,...,m
+
X
log(p(σi2 |τ 2 , υ))
i=p,1,...,m
2 where y = (yp , y1 , ..., ym ), µ = (µp , µ1 , ..., µm ) and σ 2 = (σp2 , σ12 , ..., σm ).
10
(4)
Estimating θ and κ By taking posterior expectation of the second term of equation (4), we can get
X
2 E log(p(µi |σi2 , θ, κ)) | y, θt−1 , κt−1 , τt−1 , υt−1
i=p,1,...,m
= −
Thus,
(θ − µ ˆi(t−1) )2 m + 1 1 κ X − + log(κ) + C 2 2 i=p,1,...,m κt−1 + ni σ ei(t−1) 2
P
i=p,1,...,m
2 E log(p(µi |σi2 , θ, κ)) | y, θt−1 , κt−1 , τt−1 , υt−1 can be maxi-
mized by
m
θopt
µ ˆi(t−1) X 1 / = σ e2 σ e2 i=p,1,...,m i(t−1) i=0 i(t−1) X
κopt = (m + 1)/
X i=p,1,...,m
Estimating τ 2 and υ
(θopt − µ ˆi(t−1) )2 1 + 2 κt−1 + ni σ ei(t−1)
By taking posterior expectation of the third term
of equation (4), we can get
X
2 , υt−1 E log(p(σi2 |τ 2 , υ)) | y, θt−1 , κt−1 , τt−1
i=p,1,...,m 2 X (υt−1 + ni )e σi(t−1) (m + 1)υ υτ 2 υ υt−1 + ni = log( ) − ( + 1) (log( ) − ψ( ))− 2 2 2 2 2 i=p,1,...,m
τ 2υ 2
X
1
σ e2 i=p,1,...,m i(t−1)
υ − (m + 1) log(Γ( )) 2
11
(5)
By using approximation of digamma function, which is ψ( υ2 ) ≈ log( υ2 ) − 1 υ
−
1 , 3υ 2
(5) can be maximized by
2 τopt = P
υopt =
where T =
1 m+1
P
i=p,1,...,m
log(
m+1 1 2 i=p,1,...,m σ ei(t−1)
2 q 3( 1 + 43 T − 1)
2 (υt−1 +ni )e σi(t−1)
2
2 ). ) − ψ( υt−12+ni ) − log(τopt
References [1] Gelman A, Carlin JB, Stern HS, Rubin DB (2003) Bayesian Data Analysis. Chapman and Hall/CRC, 2nd edition.
12