Let NB denotes the quantile normalized B allele frequency. ... distribution is employed to approximate the statistical distributions of B allele read count and RD:.
CLImAT: accurate detection of copy number alteration and loss of heterozygosity in impure and aneuploid tumor samples using whole-genome sequencing data Supplementary Material Zhenhua Yu, Yuanning Liu, Yi Shen, Minghui Wang and Ao Li Contents 1.
2. 3.
Supplementary Methods ...................................................................................................................................... 2 1.1. Quantile normalization of BAF ............................................................................................................................. 2 1.2. Integrated Hidden Markov Model ......................................................................................................................... 3 1.2.1. Hidden state space ............................................................................................................................................. 3 1.2.2. Emission model ................................................................................................................................................. 3 1.2.3. Parameter estimation ......................................................................................................................................... 3 1.3. Post-processing procedure for copy number annotation........................................................................................ 6 1.4. Reliability score for aberration detection .............................................................................................................. 6 1.5. Performance evaluation ......................................................................................................................................... 7 1.6. Tumor-normal admixture experiment .................................................................................................................... 7 1.7. Details of the investigated methods ....................................................................................................................... 9 1.7.1. ABSOLUTE ...................................................................................................................................................... 9 1.7.2. SNVMix .......................................................................................................................................................... 10 1.7.3. FREEC............................................................................................................................................................. 10 1.7.4. Patchwork ........................................................................................................................................................ 11 1.7.5. CLImAT ........................................................................................................................................................... 11 Supplementary Figures ....................................................................................................................................... 13 Supplementary Tables ........................................................................................................................................ 23
1. Supplementary Methods
1.1. Quantile normalization of BAF In the analysis of next-generation sequencing data, there is a loss of reads (LOR) issue encountered in the alignment step of mapping reads to reference genome (Kim, et al., 2013). The preference of existing aligner, such as BioScope and BWA (Li and Durbin, 2009), for aligning reads to reference allele over to the alternative allele makes BAF plots are asymmetrically positioned around 0.5. We use tQN (Staaf, et al., 2008) proposed by Johan Staaf to normalize BAF signals. Let vector b1:N and T1:N denote the number of reads aligned to non-reference base and the total number of reads of N SNP positions respectively, and BAF is defined as:
BAFi
bi , i 1,2,..., N Ti
(1)
The reference allele frequency can be easily calculated as:
RAFi
Ti bi , i 1,2,..., N Ti
(2)
The BAF and RAF signals should theoretically follow same distribution, which is the core idea of quantile normalization. The detailed description of the process of quantile normalization (QN) can be found in (Bolstad, et al., 2003). A threshold of 0.9 is used in this study for the tQN of BAF and RAF. Let NB denotes the quantile normalized B allele frequency. Suppose that there are d i reads sequenced from ~ alternative chromosome and wrongly discarded by aligner for ith SNP position, and the corrected b1:N and ~ T1:N can be written as:
~ bi bi di , i 1,2,..., N
(3)
~ Ti Ti di , i 1,2,..., N
(4)
We can simply define an equation between d i and NBi as follows:
bi d i NBi , i 1,2,..., N Ti d i
NB T b d i round i i i , i 1,2,..., N 1 NBi
(5)
(6)
1.2. Integrated Hidden Markov Model 1.2.1. Hidden state space The hidden state space consists of 20 states with different copy number status and tumor genotype (Table S1). Copy number up to 7 is considered in our model. Each state in HMM corresponds to a copy number status. For example, state 1 is the case of deletion of two copies, state 5 is the case that the copy number is three with duplication of one allele and state 8 is the case that copy number is four with equal duplication of two alleles. 1.2.2. Emission model The probability distribution functions of read counts data and RD data can be found in main text. We incorporate the effect of signal fluctuation in the emission models of CLImAT, and in this case a uniform distribution is employed to approximate the statistical distributions of B allele read count and RD: 1 0 bi N i pbi | N i , c 0 N i 1 others 0
(7)
1 0 di N pd i | c 0 N 1 others 0
(8)
1.2.3. Parameter estimation We employed the expectation maximization (EM) algorithm for parameter estimation. In the expectation step, we use only heterozygous tumor genotypes included in each state to calculate the expectation of the partial log-likelihood of BAF, which is formulated as: N
C
E LLb i c log pbi | ws , N i , c i 1 c 1
N z z i c log i bi log c N i bi log1 c i 1 c 1 yc yc bi N
C
(9)
Forward-backward algorithm (Rabiner, 1989) is used to calculate the posterior probability i c that the ith SNP to be in state c. Similarly, the expectation of the partial log-likelihood function of RD data is formulated as:
N
C
E LLd i c log pd i | ws , o, , pc , c i 1 c 1
log d i c 1 pc d i log pc c 1 pc log 1 pc N C pc pc i c c 1 pc i 1 c 1 log d i 1 log pc
(10)
In the maximization step of the EM algorithm, we use Newton algorithm to update all model parameters. The first and second partial derivative of E LLb and E LLd with respect to each of the involved parameters is derived. The update process for all the parameters is shown as following. For estimation of the parameter of tumor impurity level ws , we use both BAF and RD signals.
c ns nc E LLb N C i c s ws yc i 1 c 1
bi N i bi z z y c c c
(11)
nc c ns s ns s nc c ns nc bi N b i i 2 zc zc yc 2 2 E LLb N C s c ns nc c i 2 ws yc i 1 c 1 bi N i bi ns nc z z y y c c c c
(12)
1 pc E LLd N C ns nc 1 pc 1 pc d i c log 1 pc c i c ws 2 p p p i 1 c 1 c c c
(13)
ns nc 1 pc ' 1 pc 2 E LLd N C 1 pc d i c ' c i c 2 ws 2 p p p i 1 c 1 c c c 2
(14)
Where the function, also known as the digamma function, is the logarithmic derivative of the gamma function. We update the parameter ws for the next iteration using the following formula: ws ,n1 ws ,n
E LLb E LLd w 2 w 2 E LLb E LLd 2w 2w
(15)
Next, we use the updated parameter ws to further update other parameters. The update process for the parameter of copy neutral read count is taken as below:
1 pc E LLd N C y 1 pc 1 pc log 1 pc c i c c d i c 2 pc pc pc i 1 c 1 y 1 pc 2 E LLd N C i c c 2 i 1 c 1 2 pc
2
' d i c 1 pc ' c 1 pc pc pc
(16)
(17)
n1
E LLd n 2 E LLd 2
(18)
When is updated, the parameter o and pc are updated in the same way.
E LLd N C 1 pc i c o pc i 1 c 1
d i c 1 pc log 1 pc c 1 pc pc pc
1 pc 2 E LLd N C i c 2 o i 1 c 1 pc
on1
2
' d i c 1 pc ' c 1 pc pc pc
E LLd on 2 o E LLd 2o
c c 1 pc c 1 pc d log 1 p i c pc2 pc p E LLd N c i c pc i 1 d i c pc 2c c 1 pc c 1 pc d log 1 p c pc3 i pc p c 2 N E LLd c c ' c 1 pc 1 c ' c 1 pc c d i 2 2 2 i 1 p 2 pc p p p p p i 1 c c c c c c d i 2 c pc E LLd p pc ,n1 pc ,n 2 c E LLd 2 pc
(19)
(20)
(21)
(22)
(23)
(24)
We use the approach discussed in ref. (Rabiner, 1989) to estimate the initial state distribution and state transition matrix A. The EM algorithm for CLImAT is implemented as follows: (1) start with initial parameters 0 , A0 , ws0 , 0 , o0 , p 0 and calculate the posterior probability i1 using the standard forward-backward algorithm, (2) update 1 , A1 , ws1 , 1 , o1 , p 1 using the aforementioned method, (3) repeat steps (1) and (2) until the algorithm converges. Finally, the parameters in the last iteration of the training process will be displayed as the optimal estimators. At the same time, copy number, tumor genotype and zygosity state for each SNP can be inferred from the hidden state associated with the largest posterior probability. Moreover, we perform a simple grid search of these parameters to find optimal initial parameters,
which are necessary for model training.
1.3. Post-processing procedure for copy number annotation Due to the limited number of hidden states, the maximal copy number detected by the HMM is 7. In some particular cases, it is of interest to examine the exact number of an extremely amplified region, which may be implicated in tumor aggressiveness. Therefore, in CLImAT a post-processing procedure is performed for copy number annotation of highly amplified regions (copy number >7) by using the following equation:
m o 2 cn round rd ws 1 ws
(25)
here mrd is the mean of RD values, function round() takes the nearest integer. To further improve the resolution of CLImAT, we also provide an option („distSNPs_est‟) in configuration file to estimate copy number for the regions between distant SNPs (>1kb) by calculating the corresponding RD signals. Each region is divided into non-overlapping and equally sized windows (1kb) and the calculated RD signal is further smoothed by a local median filter. Then copy number of each window is calculated according to RD signal by using formula (25).
1.4. Reliability score for aberration detection It is important to provide a measurement for users to evaluate the reliability of CLImAT results. For a genomic region with tumor heterogeneity, the BAF/RD signals do not fit to the HMM used in CLImAT and therefore the posterior probabilities of the heterogeneous region should be much lower than those of homogeneous regions, which inspire us to use the posterior probabilities of observed BAF/RD signals to measure reliability. It is also preferable to divide the posterior probabilities of BAF/RD signals by the probabilities of expected BAF/RD signals to make reliability scores comparable among different hidden states in the HMM. Accordingly, we define a reliability score for each aberrant region in the results as follows:
pbij | ws , N ij , c pd ij | ws , o, , pc , c Scorei mean ~ (26) p b | w , N , c p d~ | w , o, , p , c ij s ij ij s c Here bij and Nij are B allelic and total read count of jth heterozygous SNP position in region i, and dij is ~ ~ corresponding RD value, bij is the expected B allelic read count and d ij is the expected RD value in state
c. Furthermore, for illustration the scores for all regions detected along the cancer genome are scaled to 0~100.
1.5. Performance evaluation We adopt the performance evaluation procedure proposed in APOLLOH (Ha, et al., 2012), in which all the calls of the informative (heterozygous) positions are used as the golden standard to compare the abilities of different computational methods in detecting genomic aberrations. Accordingly, for simulated tumor samples the CNA/LOH calls of all heterozygous positions pre-determined in tumor-normal admixture experiment (see Figure S1 and Section 1.6 for more detailed information) are treated as the ground truth. For the TNBC samples assayed by Affymetrix SNP6.0 array, we filter out uninformative homozygous positions by employing the ASCAT (Van Loo, et al., 2010) software (version 2.1) from tumor SNP array data and use the CNA/LOH calls of all heterozygous positions recognized by ASCAT as the ground truth. We adopt the standard way for performance evaluation by separately comparing the results of the computational methods investigated in this study to the ground truth in terms of sensitivity and specificity (see below). For evaluation of LOH detection, all LOH positions are treated as positives and the non-LOH positions are treated as negatives. For evaluation of CNA detection, positions with copy number alteration (copy number
2) are treated as positives, and copy neutral (copy number = 2) positions are treated as negatives. For each tumor sample, true positives (TP) are defined as positive positions that are correctly detected as positives by a computational method, true negatives (TN) are defined as negative positions that are correctly detected as negatives, false positives (FP) are defined as negative positions that are wrongly detected as positives, and false negatives (FN) are defined as positive positions that are wrongly detected as negatives. Two performance measurements, i.e. sensitivity and specificity, are employed to compare the performance of CNA and LOH detection for different methods, which are defined as follows:
sensitivit y
TP TP FN
(27)
specificit y
TN TN FP
(28)
The results generated by all investigated methods for performance evaluation and the detailed information on all evaluated genomic positions in TNBC samples are provided at http://bioinformatics.ustc.edu.cn/ CLImAT/download.html.
1.6. Tumor-normal admixture experiment Tumor-normal admixture simulation experiment is performed on chromosome 20 of human reference (NCBI build 36, hg18) by sampling reads from a control genome and a test genome under different predefined
parameters (Figure S1). The test genome is constructed according to predefined HMM state sequence with different copy number and BAF values. First, the control genome is divided into non-overlapping and equally sized segments (up to 20 segments are generated), and each segment is randomly assigned a hidden state of the HMM used in CLImAT. Similar to method used in previous study (Duan, et al., 2013), for a segment with copy number n and BAF value b, we generate n copies of the segment with SNP information added to each sequences according to the BAF value (Figure S1), we joint n-1 sequences to construct a new segment and replace the original segment on one chain with the new one. After the test genome has been generated, tumor-normal admixture experiment is performed under different tumor impurity level. Assuming that the length of control genome and test genome are Lc and Lt respectively, the length of read is l, and the level of tumor impurity is ws , for a depth of coverage c, the number of reads need to be sampled is N
c Lc . The number of reads sampled from control genome can l
be empirically calculated as follows:
Nc
ws Lc N ws Lc 1 ws Lt
(29)
The number of reads sampled from test genome is
Nt
1 ws Lt ws Lc 1 ws Lt
N
(30)
GC-content is one of the main factors that influence the depth of coverage. To simplify the form of GC-content effect on read count, we use a simple probability model to describe the sampling process,
pYi 1 | GCi , the probability that a read could be sampled from a window with GC-percentage GCi (0,1,2,…,100). The probability distribution pY 1 | GC is learned from real TNBC sample 1 using 2-copy regions. For a GC-percentage j, we get median read count (RC) of all windows that have the same GC-percentage j, and pYi 1 | GCi is calculated as follows:
pYi 1 | GCi pYi 1 | GCi j
RC j
(31)
100
RC k 0
k
The probability pYi 1 | GCi is further normalized along the genome: npYi 1 | GCi
pYi 1 | GCi W
pY k 1
k
1 | GCk
, i 1,2,..., W
(32)
Here W is the number of windows. In the sampling process, the control and test genome are first divided into
non-overlapping and equally sized windows (1000bp), and GC-percentage is calculated for each of the windows. For a window i, the number of sampled reads that start within the window can be simply obtained as follows: RN i npYi 1 | GCi N g
(33)
Here N g is the total number of reads sampled from the whole genome. We randomly chose RN i positions from window i as the start positions of the reads and then sample the reads from the genome. All the reads sampled from both control and test genomes are aligned to chromosome 20 of human reference (hg18) using Bowtie (Langmead, et al., 2009), BAM files and pileups are generated using SAMtools (Li, et al., 2009). To evaluate the reliability score for aberration detection described in Section 1.4, we further generate heterogeneous tumor data containing cancer subclones using simulated diploid tumor sample with tumor impurity level of 0.2. In region I (Figure S10), we assume there are two subclones with corresponding proportion of 0.24 and 0.56, respectively. Specifically, subclone 1 has copy number of 3 with minor copy number of 1 and subclone 2 has copy number of 4 with minor copy number of 1. We adjust the means of RD and BAF values in region I by using extensions to Equations (3) and (4) in the manuscript: 2
yc ns ws nc ,i wc ,i
(34)
i 1
2
zc ns s ws nc ,i c ,i wc ,i
(35)
i 1
where nc ,i , c,i and wc ,i are copy number, BAF and proportion of the ith cancer subclone, respectively.
1.7. Details of the investigated methods 1.7.1. ABSOLUTE ABSOLUTE (version 1.0.6) (Carter, et al., 2012) takes user-generated segmentation file as input and outputs inferred tumor purity and ploidy. We test ABSOLUTE on simulated samples. For each pair of tumor-normal samples, the segmentation file is generated using the method proposed in THetA (Oesper, et al., 2013). ABSOLUTE returns three kinds of solutions: (1) Solutions based on somatic copy number aberrations (SCNA); (2) Solutions based on recurrent Karyotypes; and (3) Solutions based on combination of SCNAs and Karyotypes. We used the default solution returned by ABSOLUTE as the final solution. When running ABSOLUTE on simulated samples, we set the maximal possible ploidy to the maximal copy
number in the simulated data rather than the default value of 10. All other parameters are set to their default values as described in the ABSOLUTE documentation. 1.7.2. SNVMix SNVMix (version 2-0.11) (Goya, et al., 2010) is a tool to predict single nucleotide variants from next-generation sequencing of tumors. When running SNVMix, we use the model parameter file provided at the SNVMix website: http://compbio.bccrc.ca/software/snvmix/. 1.7.3. FREEC FREEC (version 6.7) (Boeva, et al., 2012; Boeva, et al., 2011) is one of the widely adopted tools to call CNA and LOH from next-generation sequencing data. FREEC can deal with either unpaired tumor samples or paired tumor-normal samples. In this study, we run FREEC on unpaired tumor samples, we use the following parameters: Datasets
User-defined FREEC parameters
Simulated diploid samples
Simulated triploid samples
ploidy = 2, numberOfProcesses = 5, sex = XX, contaminationAdjustment = TRUE, inputFormat = pileup, mateOrientation = 0, SNPfile = snp130_hg18 _1based.txt ploidy = 3, numberOfProcesses = 5, sex = XX, contaminationAdjustment = TRUE, inputFormat = pileup, mateOrientation = 0, SNPfile = snp130_hg18 _1based.txt
Simulated tetraploid
ploidy = 4, numberOfProcesses = 5, sex = XX, contaminationAdjustment = TRUE, inputFormat = pileup,
samples
mateOrientation = 0, SNPfile = snp130_hg18 _1based.txt TNBC
ploidy = 2, numberOfProcesses = 5, sex = XX, contaminationAdjustment = TRUE, inputFormat = pileup,
sample 1
mateOrientation = FF, SNPfile = snp130_hg18 _1based.txt
Real
TNBC
ploidy = 3, numberOfProcesses = 5, sex = XX, contaminationAdjustment = TRUE, inputFormat = pileup,
samples
sample 2
mateOrientation = FF, SNPfile = snp130_hg18 _1based.txt
TNBC
ploidy = 4, numberOfProcesses = 5, sex = XX, contaminationAdjustment = TRUE, inputFormat = pileup,
sample 3
mateOrientation = FF, SNPfile = snp130_hg18 _1based.txt
1.7.4. Patchwork Patchwork (version 2.4) (Mayrhofer, et al., 2013) is designed to perform allele-specific copy number analysis of whole-genome sequenced tumor tissue. It needs matched normal sample or a reference to normalize tumor sequencing data. Due to lack of matched normal samples in EGA database (EGAS00001000132) and unrelated normal samples that can adequately meet the requirements for generating a common reference file (hg18) for the TNBC samples, the performance of Patchwork was not evaluated on the TNBC samples in this study. When running Patchwork on simulated data, we use the default parameters as described in the Patchwork documentation. The intermediate arguments manually determined from plot of chromosome 20 are given as follows: Samples diploidy
Parameters
triploidy
tetraploidy
normal010
normal040
normal070
normal010
normal040
normal070
normal010
normal040
normal070
cn2
0.78
0.85
0.93
0.65
0.75
0.85
0.50
0.60
0.76
delta
0.35
0.25
0.14
0.27
0.23
0.12
0.20
0.15
0.09
het
0.25
0.25
0.25
0.27
0.27
0.26
0.30
0.28
0.24
hom
0.95
0.75
0.40
0.95
0.73
0.45
0.95
0.72
0.45
1.7.5. CLImAT CLImAT takes about 70 minutes to run 250G tumor WGS data, using a standard desktop PC with 3.4GHz CPU and 4G RAM. DFExtract software is developed in-house and currently works on Unix-based systems. It is required that Matlab installed on users‟ computers to run CLImAT software.
Reference Boeva, V., et al. (2012) Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data, Bioinformatics, 28, 423-425. Boeva, V., et al. (2011) Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization, Bioinformatics, 27, 268-269. Bolstad, B.M., et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics, 19, 185-193. Carter, S.L., et al. (2012) Absolute quantification of somatic DNA alterations in human cancer, Nature biotechnology, 30, 413-421. Duan, J., et al. (2013) Comparative studies of copy number variation detection methods for next-generation sequencing technologies, PloS one, 8, e59128. Goya, R., et al. (2010) SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors, Bioinformatics, 26, 730-736. Ha, G., et al. (2012) Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breast cancer, Genome research, 22, 1995-2007. Kim, S., et al. (2013) Virmid: accurate detection of somatic mutations with sample impurity inference, Genome biology, 14, R90. Langmead, B., et al. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biol, 10, R25. Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows–Wheeler transform, Bioinformatics, 25, 1754-1760. Li, H., et al. (2009) The sequence alignment/map format and SAMtools, Bioinformatics, 25, 2078-2079. Mayrhofer, M., DiLorenzo, S. and Isaksson, A. (2013) Patchwork: allele-specific copy number analysis of whole genome sequenced tumor tissue, Genome biology, 14, R24. Oesper, L., Mahmoody, A. and Raphael, B.J. (2013) THetA: Inferring intra-tumor heterogeneity from high-throughput DNA sequencing data, Genome biology, 14, R80. Rabiner, L.R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, 77, 257-286. Staaf, J., et al. (2008) Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios, BMC bioinformatics, 9, 409. Van Loo, P., et al. (2010) Allele-specific copy number analysis of tumors, Proceedings of the National Academy of Sciences, 107, 16910-16915.
2. Supplementary Figures
Fig. S1. Tumor-normal mixture experiment. A. Illustration of read sampling process. The test genome is constructed from control genome (reference genome) by randomly inserting aberrations into some specific regions. Reads are sampled from both the control and test genomes under predefined tumor impurity, and the sampled reads are aligned to the reference using Bowtie. B. Process of constructing test genome and insertion of SNP information. For the red marked region with copy number 4 and BAF value 0.75/0.25, 4 copies of the region are generated with 4 SNPs inserted to each sequences according to the BAF value (SNP is displayed as: reference allele/alternative allele, BAF value).
Fig. S2. The pipeline of CLImAT. Read depth and B allele frequency are derived from tumor WGS data using SAMtools. Signal preprocessing including GC-content and mappability correction of RD and quantile normalization of BAF are adopted to improve signal quality. RD and BAF are modeled using an integrated HMM to identify copy number alteration and LOH. In addition, CLImAT can automatically estimate tumor impurity and ploidy and output tumor genotype.
Fig. S3. Plots of original and corrected RD signals with respect to GC-Content and mappability score. A. Plots of original RD signals. B. Plots of corrected RD signals. Relationship between RD and GC-Content/mappability is fitted by a cubic polynomial curve for each P-copy region, and the results of the least-square fit for 1-copy, 2-copy and 3-copy are shown in black.
Fig. S4. Comparison of three RD correction procedures. A. GC-content correction followed by mappability correction. B. Mappability correction followed by GC-content correction. C. Simultaneous GC-content and mappability correction.
Fig. S5. Quantile normalization of BAF signals. A. Plots for original BAF signals on chromosomes 8 of TNBC sample 1 and 2. B. Plots for the BAF signals after quantile normalization.
Fig. S6. Results of CLImAT for simulated tumor samples at 60X coverage. A. Results for diploid tumor samples. B. Results for triploid tumor samples. BAF is presented by five different aberration states: homozygous deletion (HOMD), hemizygous deletion (HEMD), heterozygous (HET), copy neutral LOH (NLOH) and amplified LOH (ALOH). LRR/RD is presented by homozygous deletion (HOMD), hemizygous deletion (HEMD), neutral (NEUT) and amplification (AMP). The left labels show the level of tumor impurity, for example “T90N10” denotes that tumor impurity level is 0.1 with 90%tumor and 10% normal cells.
Fig. S7. Estimated tumor impurity of simulated samples at 30X coverage. 2p:diploid samples, 3p: triploid samples, 4p:tetraploid samples.
Fig. S8. CNA detection performance of FREEC and CLImAT. A. Results for diploid genome. B. Results for triploid genome. C. Results for tetraploid genome.
Fig. S9. Performance of CLImAT for simulated data at 10X coverage. A. Results for LOH detections. B. Results for CNA detections.
Fig. S10. Reliability scores for simulated tumor data containing two cancer subclones. Subclone 1 has copy number of 3 and subclone 2 has copy number of 4 in region I. For other regions both subclones share the same copy number. The reliability score of region I is significantly lower than those of other regions, for the reason that the BAF/RD signals in region I do not fit to the HMM.
3. Supplementary Tables Table S1. Definition of hidden states in CLImAT State
Copy number
(Tumor genotype, normal genotype)
Zygosity status
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0 1 2 2 3 3 4 4 4 5 5 5 6 6 6 6 7 7 7 7
(N/A, AA), (N/A, BB), (N/A, AB) (A,AA), (B,BB), (A,AB), (B,AB) (AA,AA), (BB,BB), (AB,AB) (AA,AA), (AA,AB), (BB,BB), (BB,AB) (AAA,AA), (BBB,BB), (AAB,AB), (ABB,AB) (AAA,AA), (AAA,AB), (BBB,BB), (BBB,AB) (AAAA,AA), (BBBB,BB), (AAAB,AB), (ABBB,AB) (AAAA,AA), (BBBB,BB), (AABB,AB) (AAAA,AA), (BBBB,BB), (AAAA,AB), (BBBB,AB) (AAAAA,AA), (BBBBB,BB), (AAAAB,AB), (ABBBB,AB) (AAAAA,AA), (BBBBB,BB), (AAABB,AB), (AABBB,AB) (AAAAA,AA), (BBBBB,BB), (AAAAA,AB), (BBBBB,AB) (AAAAAA,AA), (BBBBBB,BB), (AAAAAB,AB), (ABBBBB,AB) (AAAAAA,AA), (BBBBBB,BB), (AAAABB,AB), (AABBBB,AB) (AAAAAA,AA), (BBBBBB,BB), (AAABBB,AB) (AAAAAA,AA), (BBBBBB,BB), (AAAAAA,AB), (BBBBBB,AB) (AAAAAAA,AA), (BBBBBBB,BB), (AAAAAAB,AB), (ABBBBBB,AB) (AAAAAAA,AA), (BBBBBBB,BB), (AAAAABB,AB), (AABBBBB,AB) (AAAAAAA,AA), (BBBBBBB,BB), (AAAABBB,AB), (AAABBBB,AB) (AAAAAAA,AA), (BBBBBBB,BB), (AAAAAAA,AB), (BBBBBBB,AB)
Del Del, LOH Het LOH Het LOH Het Het LOH Het Het LOH Het Het Het LOH Het Het Het LOH
Table S2. Performance comparison of Patchwork and CLImAT on simulated data at 30X coverage. Aberration
Performance measurement Sensitivity
LOH Specificity Sensitivity CNA Specificity
Tumor impurity level diploidy
Methods
triploidy
tetraploidy
0.1
0.4
0.7
0.1
0.4
0.7
0.1
0.4
0.7
Patchwork
0.98
0.98
0.98
0.98
0.82
0.82
0.85
0.86
0.86
CLImAT
1.00
1.00
1.00
1.00
0.99
1.00
0.99
1.00
1.00
Patchwork
0.37
0.52
0.29
0.92
0.96
0.96
0.98
0.98
0.98
CLImAT
1.00
1.00
0.99
1.00
1.00
1.00
1.00
1.00
1.00
Patchwork
0.98
0.98
0.98
0.98
0.98
0.99
0.98
0.97
0.98
CLImAT
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
Patchwork
0.96
0.97
0.97
0.84
0.87
0.88
0.84
0.88
0.89
CLImAT
1.00
1.00
1.00
1.00
0.99
0.99
1.00
1.00
1.00
Table S3. CNA detection performance for TNBC samples. FREEC
CLImAT
Samples Sensitivity
Specificity
Sensitivity
Specificity
Sample 1
0.99
0.98
0.99
0.98
Sample 2
0.87
0.10
0.97
0.97
Sample 3
0.99
0.38
0.99
0.99