PSSV: A novel pattern-based probabilistic approach ...

1 downloads 0 Views 2MB Size Report
Sep 21, 2016 - py and predict long-term prognosis, it is important to identify somatic ... We developed a pattern-based probabilistic approach (PSSV) for so-.
Bioinformatics Advance Access published September 21, 2016

Genome Analysis

PSSV: A novel pattern-based probabilistic approach for somatic structural variation identification Xi Chen1, Xu Shi1, Leena Hilakivi-Clarke2, Ayesha N. Shajahan-Haq2, Robert Clarke2, Jianhua Xuan1,* 1

*To whom correspondence should be addressed. Associate Editor: Prof. Alfonso Valencia

Abstract Motivation: Whole genome DNA-sequencing (WGS) of paired tumor and normal samples has enabled the identification of somatic DNA changes in an unprecedented detail. Large-scale identification of somatic structural variations (SVs) for a specific cancer type will deepen our understanding of driver mechanisms in cancer progression. However, the limited number of WGS samples, insufficient read coverage, and the impurity of tumor samples that contain normal and neoplastic cells, limit reliable and accurate detection of somatic SVs. Results: We present a novel pattern-based probabilistic approach, PSSV, to identify somatic structural variations from WGS data. PSSV features a mixture model with hidden states representing different mutation patterns; PSSV can thus differentiate heterozygous and homozygous SVs in each sample, enabling the identification of those somatic SVs with heterozygous mutations in normal samples and homozygous mutations in tumor samples. Simulation studies demonstrate that PSSV outperforms existing tools. PSSV has been successfully applied to breast cancer data to identify somatic SVs of key factors associated with breast cancer development. Availability: An R package of PSSV is available at http://www.cbil.ece.vt.edu/software.htm. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

1

Introduction

Genomic mutation analysis has been accelerated with the accumulation of DNA sequencing data acquired from exome regions to whole genome. Somatic mutations are likely to be critical factors determining how tumors progress and respond to treatments. To help optimize cancer therapy and predict long-term prognosis, it is important to identify somatic mutations. Structural variation (SV), a major type of genomic mutation (Feuk, et al., 2006), has been characterized in several studies (Freeman, et al., 2006; Yang, et al., 2013). Increasing volumes of whole genome sequencing (WGS) data have been generated from paired tumor-normal

samples, for example The Cancer Genome Atlas (TCGA) project. Thus, it is now possible to identify somatic SVs by comparing the sequence data of a tumor sample with that of its matched normal sample (Koboldt, et al., 2012; Malhotra, et al., 2013; Mills, et al., 2011). Nevertheless, the lack of powerful computational methods poses challenges to the accuracy of somatic SV identification. BreakDancer (Chen, et al., 2009) was first developed to predict basic types of SVs (i.e., deletion, insertion, inversion and translocation) using discordantly mapped reads from one or multiple WGS data samples. GASVPro (Sindi, et al., 2012) integrates read coverage and discordant reads for deletion and inversion prediction only. These two early SV detection methods were not developed for somatic SV prediction.

© The Author (2016). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Downloaded from http://bioinformatics.oxfordjournals.org/ at Simons Foundation on September 26, 2016

Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State 2 University, 900 North Glebe Road, Arlington, VA 22203, USA., Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, 3970 Reservoir Road, Washington, DC 20057, USA.

than that of existing approaches, especially for predicting somatic SVs with homozygous status in the tumor sample and heterozygous status in the normal sample. We further applied PSSV to 14 TCGA breast cancer samples and compared the predicted somatic SVs to another independent study (Malhotra, et al., 2013) using a similar data set. After a detailed examination of the results of PSSV, we have not only identified several frequent somatic SVs (which are reported by another breast cancer study (Cancer Genome Atlas, 2012) using whole exome sequencing data), but have also observed some somatic SVs on a high fraction of Polycombgroup (PcG) genes. PcG genes can act as tumor suppressors, and their malfunction may trigger cancer progression (Di Croce and Helin, 2013). Further, for somatic deletions, after removing any deletions reported by 1000GP (Mills, et al., 2011) that occur in normal population, we observed functional enrichments on PI3K-AKT signaling, RAP1 signaling and RAS signaling pathways. By examining the expression change of each gene between tumor samples with and without mutations, we collected more evidence to support the causal relationship between somatic mutations and gene expression changes.

2

Methods

2.1 Overview of PSSV The workflow of the proposed PSSV approach is shown in Fig. 1. First, we identify a candidate SV pool from a pair of tumor and normal samples by hierarchical clustering of discordant reads in each. Types of SVs include deletion, insertion, inversion and translocations, as illustrated in Supplementary Fig. S1. The proportion of normal cells in the tumor sample is estimated for the removal of tumor sample impurity; the discordant and concordant read counts are calculated in the ‘purified’ tumor sample. For each SV, the concordant reads mapped around the mutation region are also counted. Then, each type of read count is assumed to follow a Poisson mixture distribution with three components respectively

Fig. 1. Flowchart of PSSV to identify somatic SVs from paired tumor-normal samples. PSSV features (i) read counts that are modeled by a mixture Poisson distribution with three components representing non-mutation, heterozygous and homozygous mutation statuses; (ii) modeling of each SV as a mixture of six hidden states representing different somatic and germline mutation patterns.

Downloaded from http://bioinformatics.oxfordjournals.org/ at Simons Foundation on September 26, 2016

CREST (Wang, et al., 2011) uses soft-clipped reads to identify breakpoints of SVs in tumor or normal samples and then predicts mutation regions unique in tumor samples as somatic SVs. However, not all SVs have soft-clipped reads hence CREST usually predicts a short list of somatic SVs. Delly (Rausch, et al., 2012) and Lumpy (Layer, et al., 2014) integrate soft-clipped reads and discordant reads for somatic SV prediction. They work well with small-scale SVs but cannot predict large-scale SVs like inter- or intra-chromosome translocations. Note that only BreakDancer can predict all basic types of SVs but it was not developed for somatic SV prediction. Existing somatic SV prediction tools predict SVs in tumor and normal samples independently and then use a filtering step to extract unique SVs in tumor samples as somatic SV prediction. There are at least three factors that lower the confidence of SVs predicted from such two-step approach. Firstly, it is well known that germline mutations significantly outnumber somatic mutations (Malhotra, et al., 2013). High sensitivity for somatic SV prediction requires accurate differentiation between somatic and germline SVs (Escaramis, et al., 2013) (Christoforides, et al., 2013), rather than a requirement only for somatic SVs. Secondly, the impurity of tumor samples increases the difficulty in differentiating somatic mutations from germline mutations. The contamination of normal cells should be removed before somatic region calling; otherwise, read counts calculated in tumor samples are not accurate. The proportion of normal cells can be estimated by THetA (Oesper, et al., 2013) using WGS data or tools using other types of data (Aran, et al., 2015). Finally, the mutation status on two copies of each chromosome is neglected by most SV detection methods (except for GASVPro and Delly). Cells with heterozygous mutations may function normally with the wild type copy until the latter becomes somatically mutated. Measuring somatic homozygous mutations is important in predicting the rate at which a person may develop cancer (Araten, et al., 2005). Note that the chromosome diploid feature has been utilized in computational modeling of SVs (Sindi, et al., 2012)), single nucleotide variant (Saunders, et al., 2012) and copy number variation (Klambauer, et al., 2012). We developed a pattern-based probabilistic approach (PSSV) for somatic SV prediction by jointly modeling discordant and concordant read counts (Chen, et al., 2009) from paired samples in a Bayesian framework. PSSV is specifically designed to predict somatic deletions, inversions, insertions and translocations by considering their different formation mechanisms. In detail, we define the ‘true’ mutation pattern (mutation situations in paired normal and tumor samples) of a SV as one of six states including three ‘somatic’, two ‘germline’ and one ‘none’ (as a special case of ‘germline’). Under each state, the discordant and concordant read counts are assumed to follow specific Poisson distributions. Note that the discordant and concordant read counts in tumor samples are properly normalized to remove normal cell contamination. Then, each SV identified from paired samples of individual patient is modeled as a mixture of hidden states. A Poisson mixture distribution is then introduced to each type of read count. Through an expectationmaximization (EM) algorithm, we iteratively estimate the prior probability for each state as well as the parameters of the Poisson mixture distribution. For an individual SV, we can then estimate a posterior probability vector representing the possibility that its mutation pattern can be explained by a specific state. Finally, we select a pattern with the largest probability for SV state prediction and then cluster all SVs into either somatic or germline catalogs. The proposed PSSV approach is first evaluated by simulation studies. Our simulation results show that PSSV is robust against the noise in WGS data caused by mapping errors and imperfect classification of discordant reads. The precision-recall performance of PSSV is better

representing non-mutation, heterozygous and homozygous mutation statuses. Non-mutation, heterozygous and homozygous mutation statuses in each sample can be modeled according to local alignments of discordant and concordant reads. Considering the mutation statuses in a pair of tumor and normal samples, patterns of somatic or germline SVs can be characterized by six hidden states. Using a Bayesian framework, PSSV analyzes discordant and concordant read counts jointly from paired samples, and iteratively estimates the prior probability and the Poisson distribution parameters for each hidden state, and the posterior probability for an individual SV using the EM algorithm. Finally, highly confident somatic SVs of (States 2, 3 and 5 defined in Fig. 1) are reported.

2.2 Candidate structural variation detection

2.3 PSSV model description After having identified candidate SVs from WGS data, for the n -th SV, we generate four read counts including k n , D ,T and kn,C ,T , discordant (D) and concordant (C) read counts in a ‘purified’ tumor (T) sample (after the removal of tumor sample impurity), and kn , D , N and k n ,C , N , discordant and concordant read counts in the matched normal (N) sample, as shown in Fig. 2. We use index m to denote each type of read and then define a read count vector k n = [ kn ,1 ,..., kn ,m ,..., kn ,4 ] . For SV detection in a single sample, an individual SV region is formed by non-mutation, heterozygous mutation or homozygous mutation. In the PSSV model, we model each type of read count with a Poisson mixture distribution of three components as:

where index i represents different mutation status in a single sample. More details about the Poisson mixture distributions for deletion, insertion and inversion can be found in Supplementary Figs. S3 – S6. As shown in Fig. 2, for a pair of samples, we jointly model four types of read counts stored in k n to identify somatic or germline SVs. The final mutation pattern is jointly determined by mutation statuses in paired samples. Here, we use index i (0~2) and j (0~2) to denote mutation statuses in tumor and normal samples respectively. Since somatic or germline mutations refer to genomic changes from normal to tumor samples, only those mutation states with i ≥ j are biologically meaningful. In total we have six states (State 1 ~ 6) as shown in Fig. 2. States with i > j , like States 2, 3 and 5, represent somatic mutations. States with i = j , like States 1, 4 and 6, represent germline mutations. Nonmutation ( i = j = 0 ) is a special case of germline mutation. The mutation pattern of the n -th SV is modeled by a mixture of six hidden states with a posterior probability vector a n = {an ,( i , j ) | 0 ≤ j ≤ i ≤ 2} . Each posterior probability an ,( i , j ) is defined by:

an ,( i , j ) = P (k n | i, j ) P (i, j ) /

∑ P(k

n

| i , j ) P (i , j ) ,

states. Somatic mutation occurs during the process of transition from normal cell to neoplastic cell. Only six states (combined pattern of mutation statuses in paired samples) are biological meaningful, including three ‘somatic’, two ‘germline’ and one ‘none’ (as a special case of ‘germline’).

(2)

j ,i ≥ j

where n is the index of SV, P(k n | i, j ) is the joint likelihood function for State (i, j ) , and P(i, j ) is the prior probability for State (i, j ) . For the likelihood function P(k n | i, j ) in (2), given the State (i, j ) , the distribution of each read type is conditionally independent. Therefore, P(k n | i, j ) can be calculated as:

P (k n | i , j ) = P ( k n ,1 | i ) P ( k n , 2 | i ) P ( k n ,3 | j ) P ( k n , 4 | j ) = Pois ( k n ,1 | λ1, i ) Pois ( k n , 2 | λ2, i )

.

(3)

Pois ( k n ,3 | λ3, j ) Pois ( k n ,4 | λ4, j ) λ as Here, we define a Poisson parameter matrix {λm=1(2),i , i = 0 ~ 2; λm=3(4), j , j = 0 ~ 2} . Selection of λ will determine the final performance for somatic SV detection. For deletion or insertion, the classification of discordant and concordant reads is based on the insert size distribution of their paired ends, where the value of the cut-off threshold will affect the number of read counts and the overall distribution of each type. Therefore, the actual value of λ is unknown and must be estimated. We assume a Gamma prior P(λ ) on each λ as shown in (4), which is a conjugate prior of the Poisson likelihood.  P (λm , i ) = Gamma(α m , i , β m , i ) m = 1, 2 ,   P (λm , j ) = Gamma (α m , j , β m, j ) m = 3, 4

Fig. 2. Relationship between the mutation status in each sample and six hidden

(1)

(4)

where mean value ( α / β ) of each Gamma distribution is determined by the whole genome average read depth (ARD) in the tumor or normal sample, and its mutation status; the variance ( α / β 2 ) is designed to control the scale of possible parameter values by avoiding severe overlap between the components defined in (1). For the prior probability P(i, j ) (denoted as π ( i , j ) ) in (2), we assume that π ( i , j ) follows a Dirichlet prior as shown in the following equation with modes (γ (i , j ) − 1) / ( ∑ γ (i , j ) − 6) : j ,i ≥ j

P(π) =

1 γ −1 ∑ π(i(,i ,jj)) , B(γ) j ,i ≥ j

(5)

Downloaded from http://bioinformatics.oxfordjournals.org/ at Simons Foundation on September 26, 2016

Using the BAM format WGS data from a pair of tumor-normal samples, we use a hierarchical clustering approach (Chen, et al., 2009) to generate discordant read clusters in each sample. We filter detected SVs by setting the number of discordant reads in a tumor sample ≥4. Then, we use GATK (McKenna, et al., 2010) to calculate the average concordant read depth within each candidate SV, and its corresponding flanks, in both samples. The concordant read depth within mutation regions can improve the sensitivity of somatic SV prediction. The read depth signals at both flanks can be used to normalize local read counts and eliminate GC bias and large scale copy number change, making the read counts comparable across all mutation regions. To remove normal cell contamination, the discordant and concordant read counts in the tumor sample are normalized using the estimated tumor sample impurity (Aran, et al., 2015; Oesper, et al., 2013). A flow chart of WGS data pre-processing for paired tumor-normal samples can be found in Supplementary Fig. S2.

 P ( k n , m | i = 0) = Pois(λm ,0 ), 'Non-mutation'   P ( k n , m | i = 1) = Pois(λm ,1 ), 'Heterozygous' ,   P ( k n , m | i = 2) = Pois(λm , 2 ), 'Homozygous'

where B(γ) is a constant normalization term, π = {π ( i , j ) | 0 ≤ j ≤ i ≤ 2} and γ = {γ ( i , j ) | 0 ≤ j ≤ i ≤ 2} .

2.4 Posterior probability estimation With the probabilistic model formulated, the problem of determining whether the n-th SV is a somatic or germline mutation can be mathematically stated as follows: given the model parameters ( λ ; π ) and read counts K = {k n | n = 1 ~ N } , how to estimate posterior probabilities {an | n = 1 ~ N } . Since λ and π are unknown, we define a posterior probability in the following equation to estimate these parameters: P(λ , π K )

(6)

∝ ∏  P (k n λ , π) P ( λ ) P ( π) n

In the following EM algorithm, we use the E-step and the M-step iteratively to update each parameter until the improvement of estimated posterior probability P(λ , π K ) is smaller than 10-4. Initial value setting of λ and γ can be found in Supplementary S4, Tables S1 – S4 for different SV subtypes. The EM algorithm can be summarized as follows: E-step: Based on the definition of an ,( i , j ) in (1), using Bayes’ rule, we can directly estimate a *n ,( i , j ) as follows:

a *n,(i , j ) =

π (i , j ) ∏ P(kn, m | i) P(λm,i ) ∏ P(kn, m | j ) P(λm, j ) m =1,2

∑ π ∏ P( k (i , j )

j ,i ≥ j

m =1,2

m = 3,4

n,m

| i) P(λm, i ) ∏ P(kn , m | j ) P(λm, j )

(7)

m = 3,4

M-Step: Considering the constraint ∑ π (i , j ) = 1 , we introduce a Lagrangian ,i≥ j parameter τ and maximize (6) jby estimating π(i, j) as follows:

π *( i , j ) =

 1  ∑ a *n ,( i , j ) + (γ ( i , j ) − 1)  

τ  j,i≥ j

(8)

where τ = N + ∑ (γ (i , j ) − 1) . j ,i> j λm ,i and λm , j by maximizing (6) with the updated We then update value of a *n ,( i , j ) as follows: i i  λ *m, i = ∑∑ a *n ,( i , j ) ( k n , m − 1 + α m , i ) / ∑∑ a *n ,( i , j ) (1 + β m ,i ) n j =0 n j =0   2 2 λ * = a *n ,( i , j ) ( k n , m − 1 + α m , j ) / ∑∑ a *n ,( i , j ) (1 + β m, j )  m, j ∑∑ n i= j n i= j

To evaluate PSSV’s performance on somatic SV prediction, especially when discordant and concordant reads are imperfectly classified, for each SV subtype (deletion, insertion, inversion or translocation), we simulated 100 SVs for each mutation state. Discordant and concordant read counts were generated according to the Poisson mixture distributions as shown in Supplementary Figs. S3 – S6. The first component represents non-mutation. The mean value is non-zero because there are always some reads false positively predicted as discordant reads. The mean value of this component would affect the detection accuracy of somatic mutation regions. To test the robustness of PSSV on differentiating somatic and germline mutation regions, we set the ARD in each sample as 30 and varied the mean parameter of the first component from 1 to 5 in order. We added Gaussian distributed noise to all three mutation statuses in either tumor or normal sample so as to mimic inaccuracy occurring at discordant read prediction. Taking deletion as an example, all discordant reads should have significantly longer insert sizes as compared with concordant reads. The insert size of concordant reads follows a Gaussian distribution. Thus, if a cut-off threshold is set at three standard deviations from the mean, some concordant reads would be falsely predicted as being discordant due to their long insert sizes. Also, some discordant reads with insert sizes less than the cut-off threshold will be missed. In real cases, the number of false discordant reads caused by prediction inaccuracy is lower than that caused by inaccurate alignments, although still cannot be overlooked due to their impact on the detection of all types of mutations. We simulated multiple data sets with different noise levels and applied PSSV to each data set to predict somatic SVs. The prediction performances on different SV subtypes are shown in Fig. 3. For deletion SV detection, the performance of PSSV is robust against noise even though the mean value of the first component of Poisson distribution is high (Fig. 3(a)). The impact of Gaussian noise is evident but PSSV provides a high accuracy even when the level of Gaussian noise is high. For insertion SV detection, only discordant read counts can be utilized since there is no concordant read coverage change at insertion regions. In addition, only partial discordant reads can be recorded because some discordant reads carrying insertion information cannot be successfully aligned to the reference genome. The accuracy of insertion detection degrades much faster when the level of noise, especially the Gaussian distributed noise, increases (Fig. 3(b)). However, if the discordant read count is accurate and the Gaussian distributed noise is small, PSSV can still achieve a high accuracy on insertion SV detection.

(9)

Note that in our experiments testing PSSV on both simulated and real data, the EM algorithm will usually converge within 20 iterations. A minimum number of 50 SVs in each state are needed to ensure the accuracy in parameter estimation. After convergence, for the n -th SV, we select the state with the highest a *n,(i , j ) to represent the mutation pattern of the region. If a *n,(i , j ) > 0.5 and i > j (States 2, 3 or 5), we will predict this SV as a somatic mutation. More details about the estimation of each posterior probability and each model parameter can be found in Supplementary S5.

(a)

(c)

3

Results

(b)

(d)

Fig. 3. Performance evaluation of PSSV on simulated data. (a), (b), (c) or (d) represents the accuracy of PSSV on detecting somatic deletions (a), somatic insertions (b)

somatic inversions (c) or somatic translocations (d) under different noise levels.

Downloaded from http://bioinformatics.oxfordjournals.org/ at Simons Foundation on September 26, 2016

    ∝ ∏  ∑ π ( i , j ) ∏ P ( k n , m | i ) P (λm , i ) ⋅ ∏ P ( k n , m | j ) P (λm , j )   P ( π) n  m =1,2 m = 3,4  j ,i ≥ j   

3.1 Performance evaluation by simulation studies

(a)

(b)

Supplementary Fig. S7. Then, we sequenced the genome using WGSIM (Li, et al., 2009). In this study, we simulated six different scenarios (Cases 1 - 6) by varying read coverage, read length and read insert size (mean and standard deviation). For each case, we generated three replicates by changing SV locations. Simulation parameters can be found in Supplementary Table S5. Briefly, Cases 1, 2 and 3 have different read coverage: 60, 30 and 15, respectively; Cases 2 and 4 have different read insert sizes: 350 bps and 200 bps, respectively; Cases 2, 5 and 6 have different read lengths: 100 bps, 75 bps and 150 bps, respectively. Sequenced reads were aligned to the reference genome using BWA (Li and Durbin, 2010). The PSSV approach and five competing methods (BreakDancer (Chen, et al., 2009), CREST (Wang, et al., 2011), Delly (Rausch, et al., 2012), GASVPro (Sindi, et al., 2012) and Lumpy (Layer, et al., 2014)) were applied to the simulated data for somatic SV detection. Since each method has its own output format, we performed proper conversions of the results of each method to make it comparable to PSSV (more details can be found in Supplementary S6). Fig. 4 shows the mean and standard deviation of the F-measure (2*precision*recall/(precision + recall)) for each method on detection of each type of somatic SVs. PSSV outperforms other competing methods consistently. From Figs. 4(a), 4(b) and 4(c) we can see that in general, a lower read coverage decreases the detection accuracy of SVs. From Figs. 4(b) and 4(d), we can see that most of the methods can work well with different read insert sizes and only the deletion detection performance of GASVPro drops significantly. By comparing the performances in Case 5 and 6 (Figs. 4(e) and 4(f)) we can note that a longer read length (Case 6) can significantly improve the detection performance. More detailed performances on each somatic state can be found in Supplementary Tables S6 – S11. Importantly, it can be found that PSSV has a significant performance improvement on the detection somatic SVs with State 5 (homozygous mutation in the tumor sample but heterozygous mutation in the normal sample).

3.2 Breast cancer WGS data analysis

(c)

(d)

(e)

(f) Fig. 4. F-measures of competing methods as tested on different simulation cases: Case 1 (a); Case 2 (b); Case 3 (c); Case 4 (d); Case 5 (e); Case 6 (f).

3.2.1 Identification of somatic SVs from TCGA breast cancer WGS data To demonstrate the capability of PSSV on real WGS data analysis, we analyzed 14 TCGA estrogen receptor negative (ER-) breast cancer patients with sample pairs. For each TCGA tumor sample, tumor sample impurity has been reported in (Aran, et al., 2015); the reported proportion of normal cells can be found in Supplementary Table S12. For other WGS datasets, tumor sample impurity can be estimated using THetA or other methods mentioned in (Aran, et al., 2015). We then applied PSSV to ‘purified’ tumor sample and normal sample and identified somatic SVs with State 2, 3, or 5 (posterior probability > 0.5). Taking one paired TCGA sample as an example, in Supplementary Figs. S8 – S11 we show that PSSV can predict candidate SVs in terms of their different states and the distribution of discordant read count for each state is correctly estimated. Specific to each somatic state, the discordant read counts were significantly different between tumor and normal samples. We then applied PSSV to the remaining 13 paired samples and predicted somatic SVs on each. In total, we identified 3,876 deletions, 6,672 insertions, 4,003 inversions and 3,862 translocations with State 2; 601 deletions, 165 insertions, 435 inversions and 188 translocations with State 3; 469 deletions, 95 insertions and 170 translocations with State 5. As shown in Fig. 5, about 1070 of deletions, 260 of insertions, 435 of inversions and 358 of translocations have homozygous mutation pattern (States 3 and

Downloaded from http://bioinformatics.oxfordjournals.org/ at Simons Foundation on September 26, 2016

For inversion detection, as shown in Fig. 3(c), PSSV achieved robust performance under different noise conditions. The impact of Gaussian noise is much lower than that on deletion or insertion SV detection because the insert size is not used to define inverted discordant reads, whereas the orientation of two ends of a read is considered. For chromosome translocation SV detection, the performance is quite robust (Fig. 3(d) because two ends of each read used to predict translocation are either mapped to two distant locations on the same chromosome or to two different chromosomes. The ambiguity of such discordant reads is lower than that of the other SV types. To compare PSSV with existing somatic SV prediction approaches, we used a SV simulation tool, RSVSIM (Bartenhagen and Dugas, 2013), to simulate a pair of genomes as ‘tumor’ and ‘normal’ using a segment of human reference genome (GRCh37). We implanted 100 germline and another 100 somatic mutations for each state of deletions, insertions, inversions or chromosome translocations into the reference genome. Mutation length of each SV is randomly selected from the real length distribution of SV detected from TCGA tumor samples, as shown in

(b)

(c)

(d)

Fig. 5. Summary of somatic SVs detected from the TCGA breast cancer WGS data. Proportion of PSSV-detected somatic deletions with each somatic state (a); proportion of PSSV-detected somatic insertions with each somatic state (b); proportion of PSSVdetected somatic inversions with each somatic state (c); proportion of PSSV-detected somatic translocations with each somatic state (d).

5). More detailed results for individual tumor samples can be found in Supplementary Table S13. To estimate the false discovery rate (FDR) of somatic SV predictions, for each pair we flipped tumor and normal samples and predicted ‘somatic’ regions in the normal sample using PSSV (Malhotra, et al., 2013). As shown in Supplementary Table S14, most somatic predictions have a FDR < 0.10. The length distributions of PSSV-predicted somatic SVs are shown in Supplementary Fig. S12. For insertion, due to using discordant reads with significantly shorter insert size (Chen, et al., 2009), we only predict insertions up to 100 bps.

3.2.1

Somatic SV validation

To validate somatic SVs predicted by PSSV, we compared the results of PSSV with that from another independent study on nine TCGA breast cancer samples (Malhotra, et al., 2013), in which deletions, inversions and translocations were computationally validated. The somatic SV prediction process in that study used multiple filtering steps and compared against SVs predicted from other tissues; the final set of validated somatic SVs includes 385 deletions, 42 inversions and 293 chromosome translocations. In addition, the authors provided about 6,000 germline SVs that can be used to estimate the false positive rate of somatic predictions. As a result, in total we have validated 121 somatic deletions, 11 somatic inversions and 33 translocations; the false positive rate of PSSVpredicted somatic SVs in each sample is less than 0.01. The number of validated somatic SVs and the false positive rate of individual samples are summarized in Supplementary Table S15. We proceeded to investigate the PSSV-identified genes in available

3.2.2

3.2.3

(a)

(b)

Fig. 6. Detected somatic SVs in breast cancer specific genes and in Polycomb genes. ‘Red’ color represents the posterior probability reported by PSSV for each somatic SV in

each sample.

Refining somatic SVs by excluding population variants

Due to the limited number of breast cancer tumor samples used in this study, some predictions of PSSV may not be breast cancer specific. The Structure Variation Analysis Group (SVAG) identified 22,025 deletions from 179 unrelated individuals in 1000GP (Mills, et al., 2011). If a breast cancer somatic SV matches a variant identified from this normal human population, such calls are more likely to be non-somatic, or non-tumor related. We compared 4,946 PSSV predicted somatic deletions with any known variants in the normal human population to exclude spurious somatic calls. There are 4,705 (~95%) somatic deletions without any overlap with normal deletions. We are interested in somatic SVs that lay in functional genomic regions like promoter and coding regions. Therefore, we extracted gene coding regions from UCSC human reference genome hg19. For each gene, its upstream and downstream 10k bps from the transcription starting site is defined as promoter region. We finally found 1,090 genes with somatic deletions at promoter or coding regions. Pathway enrichment results of these somatic genes using DAVID (Huang, et al., 2009) can be found in Supplementary Table S17. Enrichment of PI3K-AKT signaling pathway is quite consistent to the previous mutation observation on PIK3CA. It may be interesting to find that RAP1 signaling and RAS signaling pathways are enriched here. RAP1 activation has been found to associate with breast cancer metastasis progression (McSherry, et al., 2011), which is more possible to occur among ER- breast cancer patients. Targeting RAP1 signaling could provide a means to control metastasis of breast cancer. It was previously reported that RAS mutations are rare in breast cancer(Bamford, et al., 2004), but alterations in upstream regulators and downstream effectors activate RAS pathway frequently (Downward, 2003). In this study, we have identified breast cancer somatic mutations on 21 genes involved in RAS signaling pathway.

Impact of somatic SVs on gene expression pattern

mRNA abundance provides a quantitative evaluation of gene expression and RNA-seq data can be used to confirm genomic mutations (Network, 2012). We downloaded the RNA-seq data for the previously selected 14 TCGA tumor samples. There are 965 genes exhibiting somatic SVs (including deletions, insertions, and inversions) in at least two samples (Supplementary Fig. S13). We extracted their gene expression data (level

Downloaded from http://bioinformatics.oxfordjournals.org/ at Simons Foundation on September 26, 2016

(a)

databases to check whether PSSV had captured somatic SVs occurring at any known breast cancer specific genes. In a previous study (Koboldt, et al., 2012; Mills, et al., 2011), researchers reported 20 breast cancer related genes with somatic copy number variations using whole exome sequencing data of 507 TCGA breast cancer patients, including all individuals used in our analysis. PSSV successfully captured 10 genes. We further compared our results with available literature and identified another six breast cancer related genes. The posterior probabilities of the somatic SVs occurring at these sixteen breast cancer specific genes are shown in Fig. 6(a). We found that MLL3 and PIK3CA, with frequent mutations in multiple types of human tumors including breast cancer (Karakas, et al., 2006; Wang, et al., 2011), contained somatic deletions, insertions, or inversions in multiple samples. It has been also known that somatic genetic alteration of the Polycomb-group (PcG) genes predisposes normal cells to various types of disease (Di Croce and Helin, 2013). PcG genes are known tumor suppressor genes, whose target genes have been demonstrated to be predictive for breast cancer prognosis (Jene-Sanz, et al., 2013). Among the 30 PcG genes reported in (Di Croce and Helin, 2013), PSSV identified 13 genes with somatic SVs, of which the posterior probabilities are shown in Fig. 6(b).

3) from all downloaded RNA-seq profiles. For each candidate gene, we compared expression levels between the samples with and without somatic SVs. Finally, in total 184 (~20%) differential genes were selected with fold changes > 2; the gene list can be found in Supplementary Table S18. Functional enrichment analysis using DAVID shows that MAPK signaling pathway and p53 signaling pathway are enriched, respectively. In the KEGG pathway database, these two signaling pathways are closely linked with PI3K-AKT signaling pathway.

Discussion

Funding This work was supported in part by the National Institutes of Health (CA149653 to J.X., CA149147 & CA184902 to R.C., and CA164384 to L.H.-C.)

References Aran, D., Sirota, M. and Butte, A.J. Systematic pan-cancer analysis of tumour purity. Nature communications 2015;6:8971. Araten, D.J., et al. A quantitative measurement of the human somatic mutation rate. Cancer Res 2005;65(18):8111-8117. Bamford, S., et al. The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. British journal of cancer 2004;91(2):355-358. Bartenhagen, C. and Dugas, M. RSVSim: an R/Bioconductor package for the simulation of structural variations. Bioinformatics 2013;29(13):1679-1681. Cancer Genome Atlas, N. Comprehensive molecular portraits of human breast tumours. Nature 2012;490(7418):61-70. Chen, K., et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature methods 2009;6(9):677-681. Christoforides, A., et al. Identification of somatic mutations in cancer through Bayesian-based analysis of sequenced genome pairs. BMC genomics 2013;14:302. Cibulskis, K., et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 2013;31(3):213-219. Di Croce, L. and Helin, K. Transcriptional regulation by Polycomb group proteins. Nature structural & molecular biology 2013;20(10):1147-1155. Downward, J. Targeting RAS signalling pathways in cancer therapy. Nature reviews. Cancer 2003;3(1):11-22. Escaramis, G., et al. PeSV-Fisher: identification of somatic and non-somatic structural variants using next generation sequencing data. PloS one 2013;8(5):e63377. Feuk, L., Carson, A.R. and Scherer, S.W. Structural variation in the human genome. Nat Rev Genet 2006;7(2):85-97. Freeman, J.L., et al. Copy number variation: new insights in genome diversity. Genome Res 2006;16(8):949-961. Huang, d.W., Sherman, B.T. and Lempicki, R.A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009;4(1):44-57. Jene-Sanz, A., et al. Expression of polycomb targets predicts breast cancer prognosis. Mol Cell Biol 2013;33(19):3951-3961. Karakas, B., Bachman, K.E. and Park, B.H. Mutation of the PIK3CA oncogene in human cancers. British journal of cancer 2006;94(4):455-459. Klambauer, G., et al. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res 2012;40(9):e69. Koboldt, D.C., et al. Comprehensive molecular portraits of human breast tumours. Nature 2012;490(7418):61-70. Layer, R.M., et al. LUMPY: a probabilistic framework for structural variant discovery. Genome biology 2014;15(6):R84. Li, H. and Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 2010;26(5):589-595. Li, H., et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009;25(16):2078-2079. Malhotra, A., et al. Breakpoint profiling of 64 cancer genomes reveals numerous complex rearrangements spawned by homology-independent mechanisms. Genome research 2013;23(5):762-776. McKenna, A., et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20(9):1297-1303. McSherry, E.A., et al. Breast cancer cell migration is regulated through junctional adhesion molecule-A-mediated activation of Rap1 GTPase. Breast cancer research : BCR 2011;13(2):R31. Mills, R.E., et al. Mapping copy number variation by population-scale genome sequencing. Nature 2011;470(7332):59-65. Network, C.G.A.R. Comprehensive genomic characterization of squamous cell lung cancers. Nature 2012;489(7417):519-525. Oesper, L., Mahmoody, A. and Raphael, B.J. THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome biology 2013;14(7):R80. Rausch, T., et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 2012;28(18):i333-i339. Saunders, C.T., et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 2012;28(14):1811-1817. Sindi, S.S., et al. An integrative probabilistic model for identification of structural variation in sequencing data. Genome biology 2012;13(3):R22. Wang, J., et al. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nature methods 2011;8(8):652-654. Wang, X.X., et al. Somatic mutations of the mixed-lineage leukemia 3 (MLL3) gene in primary breast cancers. Pathol Oncol Res 2011;17(2):429-433. Yang, L., et al. Diverse mechanisms of somatic structural variations in human cancer genomes. Cell 2013;153(4):919-929.

Downloaded from http://bioinformatics.oxfordjournals.org/ at Simons Foundation on September 26, 2016

A pattern based probabilistic model (PSSV) has been developed for somatic SV prediction using paired tumor-normal WGS data. Based on both simulation and real TCGA breast cancer WGS data studies, we have demonstrated that PSSV can predict highly confident somatic SVs using a probabilistic model with multiple hidden states. Model parameters are learned from the input data during the EM iteration procedure of PSSV. Intra-tumor heterogeneity presents a major challenge for DNA-seq data analysis. The number of sub-clones, population size and mutation status of each sub-clone will affect the accuracy of somatic mutation prediction (Cibulskis, et al., 2013), especially on heterozygous and homozygous state prediction of using PSSV. If somatic mutation does occur in one or several sub-clones and the population size is large, PSSV will predict this mutation as somatic because its discordant read observation is higher than that in the matched normal sample. The sum of the posterior probabilities of the three somatic states defined in PSSV should be larger than the sum of the probabilities of germline states. However, in some cases it would be hard to claim which state can best represent current mutation pattern because the posterior probability calculated by PSSV for any one out of the three somatic states may not be significantly higher than that of the rest two. The posterior probability of each SV should be examined before a final prediction can be made. For tumor samples with very complex intra-tumor heterogeneity, we recommend using PSSV to only predict somatic status rather than using a specific state to denote the mutation pattern. More discussions on the effects of intra-tumor heterogeneity and large-scale copy number change can be found in Supplementary Material S8. As noted before, using discordant reads PSSV can only identify short insertions. Reads around insertion regions cannot be fully mapped to the reference genome because a certain segment is unknown. While some methods use a re-alignment technique to collect more reads carrying insertion signals, this problem is still not fully solved. Reads with longer insert size and high read coverage are needed to overcome the current bottleneck of insertion SV detection. Besides the four basic types mentioned previously, SVs also include some other complex patterns, e.g., tandem duplication. Tandem duplication (with mate reads mapping out of order but with proper orientations) is different from copy number gain, although it also results in a significant increase of read coverage. Since an SV of tandem duplication may contain one, two or more duplications, it cannot be modeled by the heterozygous/homozygous hypothesis, nor can it be predicted by the current framework of PSSV. In the future, we may extend PSSV for tandem duplication prediction using multiple Poisson components to fit the discordant read (or read coverage) distributions of multiple duplications of each region.

Conflict of Interest: none declared.

Suggest Documents