Motivation Model Performance Noncoding Classifier ...

5 downloads 0 Views 910KB Size Report
... Danny Antaki, William Brandler, Morgan Kleiber, Oanh Hong, Jonathan Sebat ... Acknowledgements: This work was supported by a grant from the Simons ...
Improving Pathogenicity Prediction of Structural Variation in Neurodevelopmental Disorders: A Machine Learning Approach Prateek Tandon, Omar Shanta, Danny Antaki, William Brandler, Morgan Kleiber, Oanh Hong, Jonathan Sebat Sebat Lab, Psychiatry Department (University of California, San Diego)

Fusing Multiple Biological Functional Annotations into a Robust Pathogenicity Score for Structural Variant Mutations

Motivation

Model Performance

§ Structural Variants (SVs) include a variety of genetic variants (> 50 bp in length) including deletions, duplications, complex rearrangements, and mobile elements. § Gene-disrupting SVs elevate risk for neurodevelopmental disorders including Autism Spectrum Disorders (ASD) and Schizophrenia (SCZ) [8, 10]. § Current analytical approaches limited to gene burden, enrichment analysis of large variants, and coding regions. Goal: Incorporate biological functional information into predicting pathogenicity of structural variation.

§ Random Forest (RF) classifier trained and tested on other SV data sets, separately for deletions and duplications. § Performance optimized over parameters of RF: number of trees and maximum depth of any given tree.

SV Functional Annotations Feature Category

Best deletions classifier: NumTrees=25 MaxDepth=10 NumFeatures=1005 Best duplications classifier: NumTrees=50 MaxDepth=15 NumFeatures=103 Classifier Features Driving Accuracy for Deletions

Variables

Simple Information • SV Length • Number of Gene Intersections • Type: Deletion or Duplication • OMIM: Autosomal Dominant / Autosomal Recessive Functional Gene • Gene sets: refGene, fetal brain and Genic Element development genes Disruption • Gene functional annotations (i.e. exons, introns, UTR3, UTR5, upstream, downstream) • Distances to nearest gene / functional element Measures of • ExAC Probability of Loss of Function Functional (pLI) Constraint • Max pLI of disrupted functional elements • Sum pLI of disrupted functional elements • CADD Single Nucleotide Polymorphism (SNP) Scores

Noncoding Classifier

Feature

Rank in Classifer

# of intersections with disrupted gene exons from refGene ExAC p_syn, p_mis, p_lof, sumPLI, maxPLI, pRecessive, pNull of disrupted refGene exons 
(p value feature = min, pLI features = max) Min Samocha P-Value (computed from all validation cases) of disrupted exons # of intersections, max/sum fold change, minP of disrupted Fetal Brain Expression gene exons # of intersections, sum fold change of disrupted Fetal Brain Expression genes sumPLI, maxPLI, pRecessive, pLoF of Fetal Brain Promoters # of intersections, sum / log fold change of disrupted Brain Expression upstream regions

1

Classifier Features Driving Accuracy for Duplications Feature

Rank in Classifer

maxPLI, sumPLI of refGene disrupted exons

1-2

% overlap features, max fold change, minP with Brain Expression gene exons

3-10 11-15

9-12

ExAC pSyn, pMis, pLof, pRecessive, pNull of refGene exons

13-16

Distance to nearest UTR5 refGene region, upstream refGene region

16-17

Distance to nearest Fetal Brain Promoter region Distance to nearest downstream refGene region, UTR3 refGene region

18

2-8

19-20 21-24 25-27

Distance to nearest UTR4, upstream, UTR3, downstream region of Brain Expression Genes and CNV intolerance

19-20 21-25

§ RF learns to rank functional importance of mutated elements in scoring pathogenicity of SV. § Higher predictive scores given to SVs that disrupt exons, especially of genes crucial to fetal brain development. § Coding regions driven; non-coding elements are secondary effect.

Performance on Data Sets

Orthogonal Gene • ASD Known Genes Disease Association• SFARI Genes Evidence • Association evidence from exome studies Non-Coding Features

• Fetal Brain Enhancers • Fetal Brain Promoters • Chromatin States / Marks • DNase I Hypersensitivity • Regulatory Motifs Total Number of 1400 Features Features:

Pathogenicity Model § CADD-like [5] training pattern where classifier delineates common, benign variation from random variation. o Non-Pathogenic: Benign variation o Pathogenic Class Data: Random variation (mutations shuffled on same chromosome, preserving same SV length distribution) Key idea: Variation depleted in nonpathogenic class expected to be under evolutionary selection. Benign Variation

Random Variation

§ Sanders Simons Simplex Collection (SSC) De Novo Mutation Data: Classifier outperforms other metrics on non-exonic SVs (i.e. SVs that do not overlap exons). § Challenge: Performance needs improvement (results are not statistically significant for non-coding variants). § Solution: Train an explicit non-coding pathogenicity predictor. § Train Set: SSC § Non-Pathogenic Class = 25,269 nonexonic, non-private deletions, 2,160 non-exonic, non-private duplications. § Pathogenic Class = Equivalent number of mutations shuffled, preserving chromosome and SV length distribution, in non-exonic regions. § Test Set: PGC § 27,090 non-exonic deletions § 16,452 non-exonic duplications

Metrics: § # Genes disrupted: Number of genes intersecting body of SV. § Max pLI: Maximum ExAC pLI [6] score of any disrupted gene. § Sum pLI: Sum of pLI scores of all genes disrupted by SV. § SVScore [4]: Sum CADD scores of base pairs disrupted by SV. § Classifier Score: Posterior probability from our random forest pathogenicity predictor. Classifier Score Performance: § ClinVar CNV Data: The ExAC pLI score typically exhibits a bi-modal PATHOGENIC distribution. Our classifier score can be accurate even in the absence of a highly constrained gene (max pLI < 0.9). § Psychiatric Genetics Consortium (PGC) Schizophrenia (SCZ) CNV Data: After filtering for known loci, our classifier outperforms other predictors and univariate measures for medium to large size variants.

Old Classifier

Coding / Noncoding Classifier

P-Value (both coding and noncoding)

P-Value (coding variants only)

P-Value (noncoding variants only)

No SV Length Covariate

p=0.017875

p=0.39639

p=0.96787

Controlling for SV Length

p=0.081959

p=0.26001

p=0.73461

No SV Length Covariate

p=0.0012844

p=0.94035

p=0.00014476

Controlling for SV Length

p=0.0011931

p=0.96291

p=0.00014934

Conclusion § Multiple functional annotations fused to score pathogenicity of structural variant for neurodevelopmental disorders. § Trained random forest performs well on multiple test sets, and can facilitate the characterization of previously undetected CNV risk § Training an explicit non-coding classifier can improve performance on variants that overlap cis-regulatory elements.

References [1] Brandler, W. M., Antaki, D., Gujral, M., Noor, A., Rosanio, G., Chapman, T. R., ... & Wong, L. C. (2016). Frequency and complexity of de novo structural mutation in autism. The American Journal of Human Genetics, 98(4), 667-679. [2] Brandler, W. M., Antaki, D., Gujral, M., Kleiber, M. L., Maile, M. S., Hong, O., ... & Tang, S. C. (2017). Paternally inherited noncoding structural variants contribute to autism. bioRxiv, 102327. [3] Fischbach, G. D., & Lord, C. (2010). The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron, 68(2), 192-195. [4] Ganel, L., Abel, H. J., FinMetSeq Consortium, & Hall, I. M. (2017). SVScore: an impact prediction tool for structural variation. Bioinformatics, 33(7), 1083-1085. [5] Kircher, M., Witten, D. M., Jain, P., O’Roak, B. J., Cooper, G. M., & Shendure, J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature genetics, 46(3), 310. [6] Lek, M., Karczewski, K., Minikel, E., Samocha, K., Banks, E., Fennell, T., ... & Tukiainen, T. (2016). Analysis of protein-coding genetic variation in 60,706 humans. BioRxiv, 030338. [7] Miller, J. A., Ding, S. L., Sunkin, S. M., Smith, K. A., Ng, L., Szafer, A., ... & Arnold, J. M. (2014). Transcriptional landscape of the prenatal human brain. Nature, 508(7495), 199-206. [8] Ripke, S., Neale, B. M., Corvin, A., Walters, J. T., Farh, K. H., Holmans, P. A., ... & Pers, T. H. (2014). Biological insights from 108 schizophrenia-associated genetic loci. Nature, 511(7510), 421. [9] Samocha, K. E., Robinson, E. B., Sanders, S. J., Stevens, C., Sabo, A., McGrath, L. M., ... & Wall, D. P. (2014). A framework for the interpretation of de novo mutation in human disease. Nature genetics, 46(9), 944-950. [10] Sanders, S. J., Ercan-Sencicek, A. G., Hus, V., Luo, R., Murtha, M. T., Moreno-De-Luca, D., ... & Mason, C. E. (2011). Multiple recurrent de novo CNVs, including duplications of the 7q11. 23 Williams syndrome region, are strongly associated with autism. Neuron, 70(5), 863885.

Acknowledgements: This work was supported by a grant from the Simons Foundation (#275724, Sebat). Research reported in this poster was supported by the National Institute of Mental Health (NIMH) of the National Institutes of Health under award numbers [R01 MH076431] and [U01 MH109501]. This study utilized CNV datasets from ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) and the Psychiatric Genomics Consortium (PMID: 27869829 )

Suggest Documents