Finding SNPs in a Sequence Using BLAST

47 downloads 1086 Views 3MB Size Report
Sep 2, 2014 - sponding alignment (B). A mismatch in the align- ment shows the pres- ence of a potential varia- tion (insert). Clicking the. Graphics link (C) dis-.
Finding SNPs in a Sequence Using BLAST Exploring the Sequence Viewer display of BLAST result for SNP discovery http://blast.ncbi.nlm.nih.gov & http://www.ncbi.nlm.nih.gov National Center for Biotechnology Information • National Library of Medicine • National Institutes of Health • Department of Health and Human Services

Background Extensive nucleotide variation data have accumulated from detailed genetic studies as well as large-scale genome analyses. Identified short nucleotide variations (SNVs), with single nucleotide polymorphism (SNPs) as the most common type, are commonly represented by variation alleles along with some amount of flanking sequences. BLAST alignment of these flanking sequences against the genome assembly is one of the method to map those SNVs on to existing genes and specific location(s) on the genome. The NCBI repository for these variations is the Short Nucleotide Variation database (known as dbSNP), which can be accessed through the web from its homepage (www.ncbi.nlm.nih.gov/snp). NCBI used to provide a BLAST service to search against the flanking sequences of existing SNVs. Using this service in variation identification was cumbersome for several reasons:  Variations identified using new technologies did not always provide their true flanking sequences.  Variations identified from array experiments often had very short flanking sequences which could map to multiple genomic locations.  Long flanking sequences for variations which have been identified in sequencing experiments often had ambiguous base calls making mapping difficult.  Insertion and deletion alleles were generally represented by an N at the FASTA sequence level. This made searching difficult for proper interpretation of the BLAST results required referencing the allele in the FASTA definition line.  Lastly, this approach did not provide the important genomic context and coordinates of identified variations. As genome assemblies and genome mapping of variations becomes more stable, directly searching with a query against the genome assemblies using the NCBI BLAST service becomes more useful. In addition, the embedded link to Graphical Sequence Viewer (SV) [1], available from the BLAST report [2], enables the graphical examination of query alignments to the genome.

Practical applications Researchers often attempt to identify variations in specific genes or regions of the genome. In silico analysis of specific sequences by alignment to those in public collections can identify patterns of mismatches indicative of a variation. Mapping variations to existing records in dbSNP can help connect them to functional analyses reported in literature and facilitate decision-making processes to enhance the speed and productivity of research. For example, two query sequences, a genomic sequence fragment GQ892012.1 and a cDNA clone CK130262.1, are related to the human IL4 gene. According to published articles (http://1.usa.gov/WK7Uth), they may contain variations that have been shown to affect the disease prognosis. The following sections of this handout will demonstrate how to identify sequence variations contained in these two sequence records and map them to highly informative records in dbSNP.

Setting up the BLAST searches Selection of BLAST algorithm To identify dbSNP records that map to the query sequence, select “nucleotide blast” from the “Basic BLAST” section on the BLAST homepage (http://blast.ncbi.nlm.nih.gov). Query input The BLAST search pages support queries in raw sequence, sequences in FASTA format, or identifiers of sequence (gis or accessiona). Accession numbers GQ892012.1 or CK130262.1 will be used in the Query box in this case. Database setup Different types of input sequences require different databases. With genomic DNA sequences as the input query, select “NCBI genomes (chromosome)” as the target database. To search with expressed sequences, “Reference RNA sequences (refseq_rna)” should be selected instead. The database sequences searched can be restricted to a particular organism using the Organism input box. This limit simplifies the subsequent analysis. Clicking the “BLAST” button submits the search to the server. The web page automatically checks for results while the search is running, and renders the results in the browser window when the search is completed. For convenience, pre-configured BLAST search pages for a human cDNA and genomic query are listed below: http://1.usa.gov/Vxl1O2 genomic query (GQ902012.1), and http://1.usa.gov/X79h6g cDNA query (CK130262.1) NCBI Handout Series | Finding SNPs Through BLAST | Last Update September 2, 2014

Contact: [email protected]  

Page 2

Finding SNPs Through BLAST

Examination of the BLAST result The genomic BLAST search result for accession GQ892012.1 (preserved under RID 0C355PKW015 ) is shown at right. The top hit shown in the Descriptions table is to a region on the GRCh38 Primary Assembly (A). Clicking on the title of that entry automatically scrolls the display to the corresponding alignment (B).

[Graphical summary trimmed for clarify.]

A mismatch in the align- A ment shows the presence of a potential variation (insert). Clicking the Graphics link (C) displays the alignment interactively in SV to allow the visualization of annotated positions corresponding to existing NCBI records, Displayed below, the SV shows the alignment between the query GQ892012.1 and a region from the GRCh38 Primary Assembly where mismatches are marked with red ticks (D). To see variations annotated in this region, customize the display using the Configure dialog box (E). Clicking the Configure button (F) activates the selected settings and enables the addition of the SNP and Cited Variants tracks. B

C

F

E

D Contact: [email protected]

NCBI Handout Series | Finding SNPs Through BLAST | Last Update September 2, 2014

Page 3 

Finding SNPs Through BLAST

Visualizing SNP-related tracks for a genome sequence The addition of SNP and Cited Variants tracks to the graphical display (below) shows all mapped variations (A) and those with links to PubMed citations (B), respectively. All three mismatches in this alignment appear to correlate with annotated variations that are referenced in publications (C). To view these regions at the sequence level, a narrow region can be selected by clicking and dragging on the ruler around the annotated variation (D) and then zooming in using the “Zoom To Sequence” menu (E). D

A B C

F

E

H

G

In the sequence-level view (F), mouseover a variation activates a pop-up box that provides more details about the variation (G). Mouseover the numbered icon (H) in the Cited Variants track displays a collection of PubMed records citing this variation. Each PubMed id links to corresponding abstract in PubMed. Descriptions for available tracks under the Variation category are summarized in Table 1 below. Table 1. SNP-related annotation tracks Track Name

Track Content

SNP

Variaions mapped to genome with unique or mul ple loca ons. 

Suspect

A subset of varia ons with ques onable mapping due to presence of paralog and/or pseudogene 

SomaƟc Allele

Varia ons with soma c origin

GMAF >= 0.01

Varia ons with global minor allele frequency >= 0.01, thus more likely to represent common variants 

AssociaƟon Results

Varia ons from Genome‐wide associa on analyses complied by NHGRI Catalog and/or from associa on  results submi ed to dbGaP 

Cited Variants

Varia ons cited using dbSNP accessions (ss or rs) in one or more PubMed ar cles (GRCh37.1) 

ClinVar

Varia ons with phenotype submi ed to the ClinVar database  

NCBI Handout Series | Finding SNPs Through BLAST | Last Update September 2, 2014

Contact: [email protected]  

Page 4

Finding SNPs Through BLAST

SNP mapping to RefSeq mRNAs The RefSeq mRNA BLAST search result for CK130262.1 (preserved as RID 0C54VWTB015) is given to the right. The top hit shown in the Descriptions table is the A IL4 transcript variant NM_000589.3 (A). Clicking on the record title automatically scrolls the page to display the alignment (B) where three mismatches are present (circled). Clicking the Graphics link (C) will display the alignment in SV to allow interactive examination of variations in the query under the context of NCBI annotations.

B C

Customizing the display around the first mismatch by zooming to the sequence level (D) and adding the SNP track (E), it is clear that the first mismatch corresponds with the existing variation record rs2070874 (F). To better identify the variations using the naming convention from the Human Genome Variation Society (HGVS), the numbering of the mRNA’s coding region can be reset to begin at the open reading frame (ATG’s A position) by right-clicking and selecting “Set Sequence Origin At Position” (G). The position of the variant is 33 bases upstream from the start of the coding start, thus this variation can now be described with an HGVS Name of NM_00589.3:c.-33C>T.

E

D

F

G

H The “Marker Details” option in the marker popup menu (not shown) displays a table (H) to provide the location relative to the newly set sequence origin. This takes the manual counting out of the process.

References 1. The Graphical Sequence Viewer.ftp://ftp.ncbi.nih.gov/pub/factsheets/Factsheet_Graphical_SV.pdf 2. The New BLAST Results Page. ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/Howto_NewBLAST.pdf 3. SNP: Database for Short Genetic Variations. ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/Factsheet_SNP.pdf Contact: [email protected]

NCBI Handout Series | Finding SNPs Through BLAST | Last Update September 2, 2014

Suggest Documents