Empowering Variant Interpretation with an Automated Classifier Applying the ACMG Guidelines for Interpretation of Sequence Variants
Dandan Xu1, Katrina Mitchel2, Jeremie Cohen1, David Gross1, David Caplan1, Michele Cargill2, Paul George1
Genomic Intelligence Platform
1. SolveBio (www.solvebio.com), New York, NY 2. Invitae, San Francisco, CA
ABSTRACT Purpose: Next-generation sequencing (NGS) based diagnostics have dramatically expanded the number of novel genetic variants needing clinical interpretation. Geneticists currently require approximately an hour per genetic variant to classify a variant’s clinical significance. This analysis bottleneck is a significant contributor to the time and cost of NGS-based tests and is a longterm barrier to realizing the dream of precision medicine. The purpose of this project is to create variant classification software that automates as much of the ACMG guidelines for interpretation of sequences variants (Richards 2015) as possible, in order to to assist genomics professionals in curating and ascertaining the clinical significance of sequence variants. Methodology: The SolveBio data platform currently indexes numerous genomic reference datasets (13 billion records from 64 continually updated and versioned datasets as of November 2015). Several of these (ExAC, ClinVar, OMIM, 1000 Genomes, certain locus specific databases, RefSeq, dbSNP, and etc) are commonly used in interpretation of sequence variants. For each of the 28 criteria in the ACMG guidelines, we assessed whether or not it was currently possible to computationally encode the criterium with existing reference datasets and an internally-built variant effect predictor and coded the ones that could be. We created a test set of all ClinVar variants that had multiple submitters (with agreeing interpretations) and did not have any Pubmed papers associated with the submissions. We assessed concordance between the ClinVar interpretations and the interpretations from the SolveBio automated classifier. Results: We computationally encoded half of the 28 criteria (8/16 pathogenic and 6/12 benign) stated in the ACMG guidelines. Each computationally encoded criterium has a comprehensive test suite and returns to the user the exact specifics of how the criterium was met or not met (with references to the relevant supporting database entry). 2,728 variants from ClinVar were automatically classified in 4,718 seconds (1.7 seconds/variant). Concordance between the ClinVar interpretation and the automated classifier was 37% for an exact match on the pathogenic, likely pathogenic, uncertain significance, likely benign, benign (P/LP/VUS/LB/B) scale, 66% on the 3 category scale (P=LP or P, etc), and 85% within one level difference (LB = B, LB, or VUS, etc). Every single variant that was called pathogenic (n=50) or benign (n=492) by the automated classifier was also pathogenic or benign in the ClinVar set. Conclusion: The ACMG guidelines, though not written with computational implementation in mind, is amenable to being computationally encoded. Several details and definitions within the guidelines are subject to semi-arbitrary cutoffs that may need to be adjusted on a gene or disease-specific basis. Variant interpretation requires highly skilled and trained genomics professionals but flexible and transparent software will greatly reduce the amount of time and effort spent on repetitive manual analysis.
CONTACT
FRAMEWORK & IMPLEMENTATION Framework The Classifier consists of - Data: Data sources and computational algorithms that are used to evaluate each rule. The data sources used include: - SolveBio’s effect prediction - ExAC - dbNSFP/CADD - ClinVar - CGD - Rules: Unique, contained, individual “pieces” of evidence used in the classification. - Rules combine logic, sequence effect prediction, database lookups for the variant and gene. - Each rule has its own test set to cover common & edge cases. - Possible outcomes of rules: Example rule: PS1 - Met – software evaluated, found rule to be 1. Check if variant is a missense_variant met. 2. Generate list of all possible genetic variants that - Not met – software evaluated, found rule not ` (assumption that we will not have the unlikely case to be met of 2 genetic variants for amino acids that span an - To be evaluated – software could not evaluate intron). this rule. 3. For each possible variant, check if exists in ClinVar - Each rule is returned with a Message: Rule and as pathogenic with 2 or more review status stars. variant specific information about why the 4. Report outcome with messages for each ClinVar outcome was met or not met record found. - Classifications: - Determine if classifications (benign, likely benign, likely pathogenic, pathogenic) are met by Example API call output (for developers/ combining rules. bioinformaticians) - Can be categorical (ACMG system) and/or https://api.solvebio.com/v1/gws/classifiers/acmg? allele=A&chromosome=12&genome_build=GRCh37&start=49434074&stop=49434074 quantitative (point system based) - Determine final classification by combining classifications. - Availability: - Web interface for single variant analysis - Web service through SolveBio API
Input variant set from ClinVar (n=2,728): - Multiple submitters, concordant interpretations - No literature was attached to these ClinVar records SolveBio ACMG-based Classifier
ClinVar
P
LP
VUS
LB
B
B
0
0
258
756
492
LB
0
0
73
54
0
VUS
0
0
418
434
0
LP
0
1
9
3
0
P XX, PG, DG, JC, MK, and DC are employees of SolveBio. KM and MC are employees of Invitae.
Example pathogenic variant: NM_000531.5(OTC):c.958C>T (p.Arg320Ter) https://www.solvebio.com/variant/GRCH37-X-38271205-38271205-T
Example benign variant: NM_003482.3(KMT2D):c.7479G>T (p.Gly2493=) https://www.solvebio.com/variant/GRCh37-12-49434074-C-A/
CONCORDANCE ANALYSIS
[email protected] www.solvebio.com
DISCLOSURES
WEB EXAMPLES
50
42
125
13
0
Results: - Evaluation of 2,728 variants takes ~1.7seconds/ variant. - There is complete concordance in 5-category classification between the ClinVar test set and the SolveBio Classifier for 37% of variants. - Every single variant that was classified benign by the SolveBio Classifier (n=492) was benign in the ClinVar set. - Every single variant that was classified pathogenic by the SolveBio Classifier (n=50) was pathogenic in the ClinVar set. Complete concordance
Within one 3 category level concordance (LB=B/LB/VUS, (B=LB/B, P=LB/P) etc)
Number of variants
1015
1813
2329
Percentage of total
37.21
66.46
85.37
USE CASES • • • • • • •
As part of a bioinformatics pipeline to annotate & filter variants ascertained from NGS-based sequencing As part of a bioinformatics pipeline to prioritize variants for manual analysis As part of a comprehensive single variant assessment Sanity checking / double checking of previously assessed variants or a variant warehouse Automated surveillance of whether or not a previously classified variant may change with new data Systematic analysis of the effects of differing SOPs and quantitative thresholds on variant classification Allows labs to compare their classification SOPs to the standard guidelines both in bulk and with single variants