Parsing, standardization, documentation, and ...

13 downloads 0 Views 3MB Size Report
Xing (Dandan) Xu, Paul George, David Gross, Jérémie Cohen, Mark Kaganovich, David Caplan. SolveBio (www.solvebio.com), New York, NY. Financial ...
Parsing, standardization, documentation, and presentation of raw ClinVar records for programmatic web access Xing (Dandan) Xu, Paul George, David Gross, Jérémie Cohen, Mark Kaganovich, David Caplan SolveBio (www.solvebio.com), New York, NY SUMMARY

RAW XML

PARSED OUTPUT

This sample variant record is based on 4 different RCV accessions combined into one web page.

• ClinVar, the largest freely available database of clinical significance classifications, is an integral part of every workflow that requires clinical interpretation of sequence variants. • Variant annotation, filtering, and ranking requires automation and programmatic access to ClinVar records. • Here we present our learned experiences from importing, documenting, and presenting ClinVar data for programmatic access. • Parsing ClinVar requires the most lines of code of any dataset in SolveBio’s system despite being one of the smaller datasets in size.

CONTACT [email protected]

ClinicalSignificance/Desc ription

MeasureSet/Name/

those disclosed.

Browser-based filtering and querying.

ClinicalSignificance/Revi ewStatus

MeasureSet/Measure/@Type MeasureSet/Measure/CytogeneticLocation

TraitSet[@Type="Disease"]/ Trait/Name

MeasureSet/Measure/SequenceLocation MeasureSet/Measure/Name/ElementValue[@Typ e="Alternate”]

MeasureSet/Measure/M easureRelationship[@Typ e="variant in gene”]

MeasureSet/Measure/AttributeSet/Attri bute[@Type="HGVS"] MeasureSet/Measure/XRef/@ID MeasureSet/Measure/AttributeSet/Attribute[@T ype="MolecularConsequence"]

` ClinVarAssertion/ClinicalSignificance/Descriptio n * Specifics of OMIM submissions often have different XML paths from the other submission/SCV records

ClinVarAssertion/ClinVarAcces sion@Acc

Graphical variantspecific interface.

Comprehensive documentation. 987 lines of ClinVarspecific code …

RAW VCF This sample line in the clinvar VCF file is comprised of 5 distinct ClinVar records (RCV accessions). 7117171029 rs78655421 G A,C,T . . RS=78655421;RSPOS=117171029;dbSNPBuildID=131;SSR=0;SAO=1;VP=0x050060000a05040402110104;GENEINFO=CFTR:1080;WGT=1;VC=SN V;PM;NSM;REF;ASP;VLD;HD;LSD;OM;NOV;CLNALLE=1,2,3;CLNHGVS=NC_000007.13:g.117171029G>A,NC_000007.13:g.117171029G>C,NC_0000 07.13:g.117171029G>T;CLNSRC=HGMD|OMIM_Allelic_Variant,.,.;CLNORIGIN=1,1,1;CLNSRCID=CM900043|602421.0005,.,.;CLNSIG=5|5|5,1,1;C LNDSDB=GeneReviews:MedGen:OMIM:Orphanet:SNOMED_CT|GeneReviews:MedGen:OMIM:Orphanet|MedGen,GeneReviews:MedGen:OMIM:Or phanet:SNOMED_CT,GeneReviews:MedGen:OMIM:Orphanet:SNOMED_CT;CLNDSDBID=NBK1250:C0010674:219700:ORPHA586:190905008|NB K1250:CN032726:277180:ORPHA48|CN221809,NBK1250:C0010674:219700:ORPHA586:190905008,NBK1250:C0010674:219700:ORPHA586 :190905008;CLNDBN=Cystic_fibrosis|Congenital_bilateral_absence_of_the_vas_deferens|not_provided,Cystic_fibrosis,Cystic_fibrosis;CLNREVS TAT=prof|single|single,not,not;CLNACC=RCV000007528.5|RCV000007529.1|RCV000078997.2,RCV000046918.2,RCV000046919.2

To parse the INFO fields: 1. RS, RSPOS, dbSNPBuildID, SSR, SAO, VP, GENEINFO, WGT, VC, PM, NSM, REF, ASP, VLD, HD, LSD, OM, and NOV are not ClinVar record specific and apply to all 5 RCV records. 2. CLNALLE, CLNHGVS and CLNORIGIN are comma-delimited and refer to a specific allele at this locus (0 being reference, -1 being unknown, 1 through N being the alternate allele, in order). 3. CLNSIG, CLNDBN, CLNREVSTAT, CLNACC are comma-delimited by specific allele and pipe-delimited by RCV accession within a specific allele. 4. CLNDSDB and CLNDSDBID are comma-delimited by allele, pipe-delimited by RCV accession, and colon-delimited by Source Database. 5. CLNSRC and CLNSRCID are comma-delimited by allele, pipe-delimited by RCV accession, and SEPARATELY colon-delimited by Clinical Source. Other notes: - Genomic coordinates can differ from the exact same record in XML format (5’ shuffled versus 3’). - Multiple copies of records possible with different rs ID.

Financial & competing interests disclosure: XX, PG, DG, JC, MK, and DC are employees of SolveBio. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from

Data infrastructure for genomics.

Programmatic access via API and languagespecific clients (Python, Ruby, Javascript, R)