Supplementary Information for The transcriptional landscape of B-cell precursor acute lymphoblastic leukemia based on an international study of 1,223 cases Jian-Feng Lia,1, Yu-Ting Daia,1, Henrik Lilljebjörnb,1, Shu-Hong Shenc, Bo-Wen Cuia, Ling Baia, Yuan-Fang Liua, Mao-Xiang Qiand, Yasuo Kubotae, Hitoshi Kiyoif, Itaru Matsumurag, Yasushi Miyazakih, Linda Olssonb, Ah Moy Tani, Hany Ariffinj, Jing Chenc, Junko Takitak, Takahiko Yasudal, Hiroyuki Manom, Bertil Johanssonb,n, Jun J. Yangd,o, Allen Eng-Juh Yeohp, Fumihiko Hayakawaq, Zhu Chena,r,s,2, Ching-Hon Puio,2, Thoas Fioretosb,n,2, SaiJuan Chena,r,s,2, Jin-Yan Huanga,s,2 a
b c
d e f g h i j k l m n o p
State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200025, China Department of Laboratory Medicine, Division of Clinical Genetics, Lund University, Lund 22184, Sweden Key Laboratory of Pediatric Hematology & Oncology, Ministry of Health, Department of Hematology and Oncology, Shanghai Children's Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200127, China Department of Pharmaceutical Sciences, St. Jude Children's Research Hospital, Memphis, TN 38105, USA Department of Pediatrics, Graduate School of Medicine, The University of Tokyo, Tokyo 1138654, Japan Department of Hematology and Oncology, Nagoya University Graduate school of Medicine, Nagoya 4668550, Japan Division of Hematology and Rheumatology, Kinki University Faculty of Medicine, Osaka 5778502, Japan Department of Hematology, Atomic Bomb Disease Institute, Nagasaki University, Nagasaki 8528521, Japan Department of Paediatrics, KK Women's & Children's Hospital, 229899, Singapore Paediatric Haematology-Oncology Unit, University of Malaya Medical Centre, Kuala Lumpur 59100, Malaysia Department of Pediatrics, Graduate School of Medicne, Kyoto University, Kyto 6068501, Japan Clinical Research Center, Nagoya Medical Center, National Hospital Organization, Nagoya 4600001, Japan National Cancer Center Research Institute, Tokyo 1040045, Japan Department of Clinical Genetics and Pathology, Division of Laboratory Medicine, Lund 22185, Sweden Department of Oncology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA Centre for Translational Research in Acute Leukaemia, Department of Paediatrics, Yong Loo Lin School of Medicine, and Cancer Science Institute of Singapore, National University of Singapore, 119228, Singapore 1
q r s 1 2
Department of Pathophysiological Laboratory Sciences, Nagoya University Graduate school of Medicine, Nagoya 4618673, Japan Key Laboratory of Systems Biomedicine, Ministry of Education, Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai 200240, China Pôle de Recherches Sino-Français en Science du Vivant et Génomique, Laboratory of Molecular Pathology, Rui-Jin Hospital, Shanghai 200025, China These authors contributed equally to this work. To whom correspondence should be addressed. E-mail:
[email protected] (Z.C.),
[email protected] (C.-H.P.),
[email protected] (T.F.),
[email protected] (S.-J.C.), and
[email protected] (J.-Y.H).
This PDF file includes: Supplementary text Figs. S1 to S14 References for SI reference citations Other supplementary materials for this manuscript include the following: Datasets S1 to S6
2
Supplementary Information Text SI Materials and Methods RNA-seq alignment and pre-process. Hisat2 (v2.0.5) (1) and STAR (v2.5.2b) (2) were used to align raw RNA-seq sequence to hg38 (Hisat2) and hg19 (STAR) human reference genome, which were downloaded from the UCSC Genome Browser (http://hgdownload.soe.ucsc.edu/) (3). STAR mapping and pre-process steps were carried out mainly according to the Genome Analysis Toolkit (GATK, v3.7.0) (4) forum recommended best practice pipeline. Mutation calling. In this RNA-driven study, a previously reported method to combine RNA-seq and subset available whole exome sequencing (WES) data (n=172) (Dataset S1) was applied to detect driver mutations from large-scale BCP ALLs RNA-seq samples (5). The final genes reported in this work were mainly depends on two condition: 1) Recurrence of gene mutations greater than or equal to 12. 2) Reported on public BCP ALLs genomic datasets. Raw sequence variants were called from all RNA-seq data according to the GATK forum recommend analysis pipeline, and results of GATK HaplotypeCaller as the high confidence sites (4, 6). Several extra variant callers, such as GATK UnifiedGenotyper, LoFreq (7) and VarScan2 (8), were used to prevent over-strict filtration of the GATK best practice method. The in-house R packages BioInstaller (http://bioinfo.rjh.com.cn/labs/jhuang/tools/BioInstaller/) was used to download the annotation databases required for RNA variant calling. RNA variant calling datasets were annotated by ANNOVAR
(9)
and
the
in-house
R
packages
annovarR
(http://bioinfo.rjh.com.cn/labs/jhuang/tools/annovarR/). The screen of the candidate driver gene mutations was carried out according to published procedures and the original articles (6, 10, 11). Screen of the candidate driver gene mutations. Screening of gene mutations from the bulk of raw variants sites followed the analysis criteria: 1) >10x coverage in the variants site. 2) Variant allele frequency ≥5% and at least three individual mutant reads. 3) Filter variants only observed on positive-strand or negative-strand. 4) Filter reads of base quality < 13. 5) Filter sites presented in 1000 Genome Project (12) and The Genome Aggregation Database (gnomAD) (13) with ≥1% frequency. 6) Filter sequence variation frequency ≤2 in normal control samples [blood in complete remission (CR), saliva, and skin tissues] from 265 ALL and lymphoma cases in our previous publications. 7) Not found in DARNED. (14) and RADAR (15) RNA-editing databases. 8) Samtools (16) and IGV (17) were applied on all reported variants sites. 9) Systematic bias sites were also evaluated by the totally recurrent counts and variant called method types. 10) Filter dbSNP (v147) (18) sites excepting those recurrently found in COSMIC (v81) (19) database and involved in leukemia.
3
Fusion gene calling. Fusioncatcher (v0.99.3d) (20) and defuse (v0.6.2) (21) were applied to detect fusion genes from raw RNA-seq data. For calling a fusion gene, the following quality controls and filtrations were performed: 1) the fusion points required support by at least two separate covered reads and three spanning reads, 2) the fusion gene pairs reported in the healthy population were filtered out, and 3) the fusion genes were filtered by a blacklist to reduce the false positive rate. Moreover, fusion gene information was collected from the original articles (5, 11, 22-25) and combined with the results of fusion gene detection in raw RNA-seq data. Gene expression analysis. We used Fragments Per Kilobase Million (FPKM) to evaluate expression levels of individual genes (10). Transcript counts table files were generated by the HTSeq (26) htseq-count subprogram using the GENCODE annotation database and processed with the BAM files generated by Hisat2 (1). FPKM values were further produced using counts table files after normalizing the length of transcripts or genes. Differentially expressed genes were obtained using DESeq2 (v1.18.1) (27). The highest variance expressed genes across all 1,223 patients were selected for the first step unsupervised clustering and eight clusters were identified (SI Appendix, Fig. S4). Genetic markers in the first seven clusters were clear, but in the last cluster were relatively complex. Therefore, we preformed two-step unsupervised hierarchical clustering. In the second step clustering, cases in the last cluster while they have same genetic features as the first seven clusters were moved back into the first seven clusters, such as ZNF384 fusions, BCRABL1, and hyperdiploidy (SI Appendix, Fig. S4). Unsupervised clustering identified seven subgroups(G1-G7) using top variance genes of all cases in the first seven clusters (Fig. 1). The top 699 genes displaying the highest variance across among the remaining 235 samples (the last cluster) were subjected to a second unsupervised clustering step, which identified the clusters G8-G14. Different number of genes (top variation of expression genes range 85% to 99%) were selected to evaluate the stability of clustering. More than 95% of BCP-ALLs were consistently located in the same subgroups. Gene Set Enrichment Analysis (GSEA) was performed using the GSEA (v3.0, http://software.broadinstitute.org/gsea) with MSigDB-hallmark gene sets (H) and MSigDB-curated gene sets (C2) (28).
4
SI Figures Fig. S1. The overall survival rates of different cohorts (SIH, LUH, MaSpore and JALSG) based on favorable (ETV6-RUNX1) (A) and unfavorable (BCR-ABL1) (B) outcomes. A
B
ETV6-RUNX1 in Pediatric B-ALL
100
++ + +
+ + +++
++ +
+
+
+
+
+ + ++
BCR-ABL1 in Pediatric B-ALL 100
++ ++ + ++ +
+
Overall survival (%)
Overall survival (%)
+
75
50
SIH LUH MaSpore JALSG
25
0 Number at risk (number censored) SIH LUH MaSpore JALSG
+++
+
75
+ +
+
50
+
SIH LUH MaSpore
25
SIH vs LUH: P = 0.997 SIH vs MaSpore: P = 0.970
0 0
10
19 (0) 40 (0) 35 (0) 2 (0)
15 (4) 38 (2) 35 (0) 2 (0)
20 8 (11) 36 (4) 35 (0) 2 (0)
30 7 (12) 36 (4) 35 (0) 2 (0)
40
50
7 (12) 34 (6) 35 (0) 2 (0)
6 (14) 32 (8) 35 (0) 1 (1)
60 4 (19) 27 (38) 35 (35) 1 (2)
0 SIH 6 (0) LUH 4 (0) MaSpore 8 (0)
10
20
30
40
6 (0) 4 (0) 8 (0)
2 (3) 3 (1) 6 (0)
2 (3) 3 (1) 6 (0)
2 (3) 1 (2) 6 (0)
50
60
2 (3) 1 (2) 6 (0)
1 (5) 1 (3) 6 (6)
5
Fig. S2. Outline of the study workflow Search the “BCP ALL RNAseq” in EGA, JGA and dbGaP Request for raw sequencing and clinical data Cohort-1 (SIH, n=172)
Cohort-2 (LUH, n=195)
Cohort-3 (JALSG , n=75)
Cohort-5 (TARGET/COG, n=416)
Cohort-6 (TARGET/COG, n=223)
Cohort-4 (MaSpore, n=204)
International large-cohorts Available primary BCP-ALLs samples RNA-seq (n=1,285)
Quality analysis
Available genomic data
62 failed quality control
Integrated analysis (n=1,223)
Public and in-house annotation database
Unified analysis pipeline
Sequence variants
Gene expression signatures
Fusion genes
+ Survival data (OS and RFS)
Age, gender, and karyotype
Prior knowledge
Unsupervised clustering algorithm
Gene-expression based BCP-ALL classification Bias factors adjust Known BCP-ALL subtypes
Novel BCP-ALL subtypes
RNA sequencing (RNA-seq) data of BCP ALL patients from five study groups (Lund University Hospital, TARGET/COG, SIH, and the MaSpore cohort) were collected from the study groups and/or downloaded from Genome-phenome Archive dbGap, EGA, JGA and CGAH. After quality control, 1,223 RNA-seq cases formed the basis for further analysis. Unsupervised hierarchical clustering methods were applied to determine in 1,223 eligible cases to determine the gene expression groups using genes showing the highest variance in their expression levels. Abbreviations: LUH, Lund University Hospital; SIH, Shanghai Institute of Hematology; JALSG, the Japan Adult Leukemia Study Group, MaSpore, the Singapore and Malaysia MaSpore cohort; TARGET/COG, Therapeutically Applicable Research to Generate Effective Treatments/COG cohort.
6
7
Fig. S3. Principal component analysis (PCA) of RNA-seq gene expression datasets before (A) and after (B) removing batch effects.
The first two principal components (PC1 and PC2) generated by expression profiles of all genes after filtration are plotted. Patients from difference resources are shown in different colors.
8
Fig. S4. Unsupervised hierarchical clustering of global gene expression profile from 1,223 BCP ALL patients.
In the heatmap, columns indicate 1,223 BCP ALL patients and rows represent gene expression levels or genetic features for each patient. Genes showing over- and under-expression in the heatmap are shown in red and blue, respectively. The boxes below the heatmap indicates genotypes and fusion genes and hotspot sequence mutations identified in the analysis. Colors of subgroups consists with the subgroup colors in Fig. 1.
9
Fig. S5. Pathway-centric overview of candidate driver gene mutations in 1,223 BCP ALL patients Subgroup
Mutation types Mutation counts
* * * * * * * *
G1 (MEF2D fusions) G2 (TCF3-PBX1) G5 (ZNF384 fusions) G6 (BCR-ABL1/Ph-like) G9 (PAX5 and CRLF2 fusions) G12 [ZEB2 (p.H1038R)/IGH-CEBPE] Missense Frameshift One hit mutation Two hit mutations
G3 (ETV6-RUNX1/-like) G7 (Hyperdiploidy) G10 [PAX5 (p.P80R)] G13 (TCF3/4-HLF) Nonsense Protein deletion Three or more hit mutations
G4 (DUX4 fusions) G8 (KMT2A fusions) G11 [IKZF1 (p.N159Y) ] G14 (NUTM1 fusions) Protein insertion Splice site
Group NRAS (176) KRAS (148) FLT3 (96) PTPN11 (64) JAK1 (20) JAK2 (19) NF1 (18) SH2B3 (18) IL7R (17) STAT5B (14)
*
Signaling molecules(494)
*
Transcription factors (170)
PAX5 (64) IKZF1 (26) ZEB2 (25) ETV6 (25) MGA (16) RUNX1 (15) MYC (12)
* * * * *
KMT2D (67)
(48) * CREBBP (36) * WHSC1 SETD2 (30)
* * * * *
*
TRRAP (22) SETD1B (22) KMT2C (21) CTCF (20) EZH2 (19) KMT2A (18) ASXL1 (16) ARID1B (15) NCOR2 (14) CHD4 (14) KDM6A (14) ARID1A (13) CHD8 (13) EP300 (12) TET2 (12) ASXL2 (12)
Epigenetic factors(351)
TP53 (40) * CDKN2A (14)
*
MED12 (13)
Cell cycle (66)
*
HERC1 (25) SACS (17) USP9X (14) ASPM (13) Others (68)
Heatmap of five functional categories of sequence mutations. Horizontally, genes are ordered by different functional categories and mutation rates. Vertically, 1,223 BCP ALL cases are ordered in consistence with subgroups defined by unsupervised hierarchical clustering heatmap. Mutations whose frequencies were significantly differently distributed in certain subgroups are indicated by red color.
10
Fig. S6. Landscape of mutation type and pairwise relationship of driver gene mutations. A
Mutation typers Missense Frameshift Nonsense Protein deletion Protein insertion Splice site Multiple
Sample count (n)
150
100
50
NRAS KRAS FLT3 KMT2D PAX5 PTPN11 CREBBP TP53 WHSC1 SETD2 IKZF1 ETV6 HERC1 ZEB2 SETD1B TRRAP KMT2C CTCF JAK1 EZH2 JAK2 KMT2A NF1 SH2B3 IL7R SACS ASXL1 MGA ARID1B RUNX1 CDKN2A CHD4 KDM6A NCOR2 STAT5B USP9X ARID1A ASPM CHD8 MED12 ASXL2 EP300 MYC TET2
0
Gene
KRAS FLT3 PTPN11 JAK1 JAK2 NF1 SH2B3 IL7R STAT5B PAX5 IKZF1 ZEB2 ETV6 MGA RUNX1 MYC KMT2D CREBBP WHSC1 SETD2 TRRAP KMT2C CTCF SETD1B EZH2 KMT2A ASXL1 ARID1B NCOR2 CHD4 KDM6A EP300 TET2 ARID1A CHD8 ASXL2 TP53 CDKN2A MED12 HERC1 SACS USP9X ASPM
B
C
CDKN2A CHD8 KDM6A ARID1B ASXL1 KMT2A SETD1B KMT2C SETD2 WHSC1 KMT2D RUNX1 ZEB2 PAX5 SH2B3 JAK2 JAK1 PTPN11
CHD8 KDM6A ARID1B ASXL1 KMT2A SETD1B KMT2C SETD2 WHSC1 KMT2D RUNX1 ZEB2 PAX5 SH2B3 JAK2 JAK1 PTPN11 NRAS
Co-occurrence
P value < 0.05
KRAS FLT3 JAK1 SH2B3 PAX5 IKZF1 SETD2 ASXL1 ARID1B CDKN2A
ARID1B ASXL1 SETD2 IKZF1 PAX5 SH2B3 JAK1 FLT3 KRAS NRAS
Adjusted P value < 0.05
D
NRAS KRAS FLT3 PTPN11 JAK1 JAK2 NF1 SH2B3 IL7R STAT5B PAX5 IKZF1 ZEB2 ETV6 MGA RUNX1 MYC KMT2D CREBBP WHSC1 SETD2 TRRAP KMT2C CTCF SETD1B EZH2 KMT2A ASXL1 ARID1B NCOR2 CHD4 KDM6A EP300 TET2 ARID1A CHD8 ASXL2 TP53 CDKN2A MED12 HERC1 SACS USP9X
(A) Top 44 recurrently mutated genes in BCP ALL patients. (B) Statistically significant pairwise relationship for co-occurrence and mutual exclusivity in 1,223 BCP ALL. P-values were calculated using two-sided Fisher’s exact test and adjusted using FDR. Co-occurrence was colored in red (P 0.05
0 0
10
20
30
40
50
60
0
10
20
30
40
50
60
Number at risk (number censored) 22 (0)
14 (3)
13 (4)
13 (4)
12 (5)
11 (17)
PAX5/CRLF2 33 (0) fusions (G9)
32 (0)
27 (2)
ETV6−RUNX1
96 (0)
90 (6)
81 (15)
80 (16)
78 (18)
74 (23)
67 (94)
ETV6−RUNX1 96 (0)
90 (6)
81 (15) 80 (16) 78 (18) 74 (23) 67 (94)
BCR−ABL1
18 (0)
18 (0)
11 (4)
11 (4)
9 (5)
9 (5)
8 (14)
BCR−ABL1 18 (0)
18 (0)
11 (4)
MLL fusions (G8) 25 (0)
25 (3)
11 (4)
24 (4)
9 (5)
23 (5)
9 (5)
21 (26)
8 (14)
Five-year overall survival (OS) curves of gene fusion or mutations with ETV6-RUNX1 (G3) as low risk and BCR-ABL1 (G6) as high risk. Survival curves were estimated with the Kaplan-Meier method and compared using two-sided log-rank test.
21
Fig. S15. The overall survival of different subgroups based on gene fusions or mutations in adult BCP ALL. A
B
+ +
100
+ +
100
C
+
+ +
100
+
+++
+
+
+
50
MEF2D fusions (G1) BCR-ABL1 MEF2D fusions(G1) vs BCR-ABL1: P > 0.05
25
+
+++ +
+
+
+
50
TCF3-PBX1 (G2) BCR-ABL1 TCF3-PBX1 (G2) vs BCR-ABL1: P > 0.05
25
D100
20
30
40
50
+
10
2 (1)
2 (1)
1 (2)
1 (2)
1 (2)
0 (3) TCF3-PBX1 9 (0) (G2)
7 (7)
0 (9)
0 (9)
0 (9)
0 (9)
0 (9)
BCR-ABL1 17 (0)
E
+ + + +
+
+
++
+
+
+ +++
ZNF384 fusions (G5) BCR-ABL1 ZNF384 fusions(G5) vs BCR-ABL1: P = 0.003
25
+ +
20
30
7 (1)
2 (4)
7 (7)
0 (9)
0
BCR-ABL1 17 (0)
10
20
30
40
50
60
+
+++
50
+
+ +
DUX4 fusions (G4) BCR-ABL1 DUX4 fusions(G4) vs BCR-ABL1: P = 0.018
25
+
60
1 (5)
1 (5)
0 (6)
DUX4 11 (0) fusions (G4)
9 (1)
7 (1)
0 (9)
0 (9)
0 (9) BCR-ABL1 17 (0)
7 (7)
0 (9)
+++
+
+
Hyperdiploidy (G7) + BCR-ABL1 Hyperdiploidy(G7) vs BCR-ABL1: P < 0.001
14 (3)
11 (5)
8 (7)
7 (8)
4 (11)
Hyperdiploidy 3 (15) (G7) 18 (0)
7 (7)
0 (9)
0 (9)
0 (9)
0 (9)
0 (9)
BCR-ABL1 17 (0)
100
30
40
50
60
6 (2)
6 (2)
5 (3)
4 (8)
0 (9)
0 (9)
0 (9)
0 (9)
+ + +
+
50
0
20
0 (9)
+ +
25
10
2 (4)
75 +
0
50
F
0
0
Number at risk (number censored) ZNF384 fusions (G5) 18 (0)
+
40
++++ +
+ +
50
100
+
75
75
0 0
60
Overall survival (%)
10
Overall survival (%)
0
MEF2D 5 (0) fusions (G1) BCR-ABL1 17 (0)
Overall survival (%)
+
75
0
0
Number at risk (number censored)
Overall survival (%)
Overall survival (%)
Overall survival (%)
75 +
+
+
+
10
20
30
40
50
13 (6)
7 (9)
6 (10)
4 (11)
1 (12)
7 (7)
0 (9)
0 (9)
0 (9)
0 (9)
60
75 +
+++
+
50 ++
MLL fusions (G8) BCR-ABL1 MLL fusions(G8) vs BCR-ABL1: P > 0.05
25
0 0
10
20
MLL 0 (13) fusions (G8) 7 (0)
7 (0)
4 (1)
0 (9)
7 (7)
0 (9)
BCR-ABL1 17 (0)
30
40
50
60
2 (1)
2 (1)
2 (1)
1 (3)
0 (9)
0 (9)
0 (9)
0 (9)
Five-year overall survival (OS) curves of distinct gene fusions or mutations with BCR-ABL1 (G6) as high risk. Survival curves were estimated with the Kaplan-Meier method and compared using two-sided log-rank test.
22
References 1. Kim D, Langmead B, & Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12(4):357-360. 2. Dobin A, et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England) 29(1):15-21. 3. Tyner C, et al. (2017) The UCSC Genome Browser database: 2017 update. Nucleic Acids Res 45(D1):D626-D634. 4. McKenna A, et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research 20(9):1297-1303. 5. Liu YF, et al. (2016) Genomic Profiling of Adult and Pediatric B-cell Acute Lymphoblastic Leukemia. EBioMedicine 8:173-183. 6. Sun Z, Bhagwate A, Prodduturi N, Yang P, & Kocher JA (2017) Indel detection from RNA-seq data: tool evaluation and strategies for accurate detection of actionable mutations. Brief Bioinform 18(6):973-983. 7. Wilm A, et al. (2012) LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res 40(22):11189-11201. 8. Koboldt DC, et al. (2012) VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome research 22(3):568-576. 9. Wang K, Li M, & Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38(16):e164. 10. Chen B, et al. (2018) Identification of fusion genes and characterization of transcriptome features in T-cell acute lymphoblastic leukemia. Proc Natl Acad Sci U S A 115(2):373378. 11. Lilljebjörn H, et al. (2016) Identification of ETV6-RUNX1-like and DUX4-rearranged subtypes in paediatric B-cell precursor acute lymphoblastic leukaemia. Nat Commun 7:11790. 12. Abecasis GR, et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422):56-65. 13. Lek M, et al. (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536(7616):285-291. 14. Kiran A & Baranov PV (2010) DARNED: a DAtabase of RNa EDiting in humans. Bioinformatics (Oxford, England) 26(14):1772-1776. 15. Ramaswami G & Li JB (2014) RADAR: a rigorously annotated database of A-to-I RNA editing. Nucleic Acids Res 42(Database issue):D109-113. 16. Li H, et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England) 25(16):2078-2079. 17. Robinson JT, et al. (2011) Integrative genomics viewer. Nature biotechnology 29(1):2426. 18. Sherry ST, et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29(1):308-311. 19. Forbes SA, et al. (2017) COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res 45(D1):D777-D783. 20. Edgren H, et al. (2011) Identification of fusion genes in breast cancer by paired-end RNA-sequencing. Genome Biol 12(1):R6. 21. McPherson A, et al. (2011) deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS Comput Biol 7(5):e1001138. 22. Yasuda T, et al. (2016) Recurrent DUX4 fusions in B cell acute lymphoblastic leukemia of adolescents and young adults. Nat Genet 48(5):569-574. 23. Gu Z, et al. (2016) Genomic analyses identify recurrent MEF2D fusions in acute lymphoblastic leukaemia. Nat Commun 7:13331.
23
24. 25. 26. 27. 28.
Qian M, et al. (2017) Whole-transcriptome sequencing identifies a distinct subtype of acute lymphoblastic leukemia with predominant genomic abnormalities of EP300 and CREBBP. Genome research 27(2):185-195. Roberts KG, et al. (2014) Targetable kinase-activating lesions in Ph-like acute lymphoblastic leukemia. N Engl J Med 371(11):1005-1015. Anders S, Pyl PT, & Huber W (2015) HTSeq--a Python framework to work with highthroughput sequencing data. Bioinformatics (Oxford, England) 31(2):166-169. Love MI, Huber W, & Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12):550. Subramanian A, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102(43):15545-15550.
24