ShiftDetector: detection of shift mutations

1 downloads 0 Views 131KB Size Report
Feb 3, 2002 - BIOINFORMATICS APPLICATIONS NOTE Vol. 18 no. 8 2002 ... of double peaks of G and C can result from superimposed cccc... with gggg... or ...
BIOINFORMATICS APPLICATIONS NOTE

Vol. 18 no. 8 2002 Pages 1137–1138

ShiftDetector: detection of shift mutations Eyal Seroussi 1,∗, Micha Ron 1 and Darek Kedra 2 1 Institute

of Animal Science, Volcani Center, Bet-Dagan 50250, Israel and Systems Department, The Burnham Institute, La Jolla CA 92037, USA

2 Information

Received on October 21, 2001; revised on February 3, 2002; accepted on March 4, 2002

ABSTRACT Summary: Sequencing of a bi-allelic PCR product, which contains an allele with a deletion/insertion mutation results in a superimposed tracefile following the site of this shift mutation. A trace file of this type hampers the use of current computer programs for base calling. ShiftDetector analyses a sequencing trace file in order to discover if it is a superimposed sequence of two molecules that differ in a shift mutation of 1 to 25 bases. The program calculates a probability score for the existence of such a shift and reconstructs the sequence of the original molecule. Availability: ShiftDetector is available from http://cowry. agri.huji.ac.il Contact: [email protected]

Deletion/insertion mutations are frequently the cause of genetic disease, e.g. 70% of the mutations in cystic fibrosis patients correspond to a three base pairs deletion (Kerem et al., 1989). Direct sequencing of PCR-amplified DNA (Du et al., 1993) is a method of choice for detection of mutations in the search for disease genes. However, sequencing of PCR products that are heterozygous to deletion/insertion mutations results in trace files, which cannot be interpreted by the current tools for sequence analysis. Following the site of the mutation base calling is hampered by ambiguity that arises from two different alleles that are shifted in their positions (e.g. Figure 1A). Such data is often discarded as a bad trace file, although it can be interpreted as a superimposed sequence of the two alleles (Figure 1B and C). We have developed a computer program that scans a trace file for such a shift, and predicts the sequence of the DNA molecule following the site of the shift. This program is a Perl script that uses the Phred -d option for base calling (Ewing et al., 1998). Each call returns the identity of the two major peaks and their quality. If only one peak is detected then the program assumes that this peak results from two identical major peaks. The program is capable of detecting shift mutations of 1 to 25 bases. For each of these possible shift sizes, the program tries to identify the nucleotide of the original ∗ To whom correspondence should be addressed.

c Oxford University Press 2002 

molecule that should repeat itself after the exact number of bases that corresponds to the size of shift. If it detects such an identity, it will store the base in the predicted molecule sequence. In cases of ambiguity (e.g. a stretch of double peaks of G and C can result from superimposed cccc... with gggg... or cgcg... with gcgc... or ccgg... with ggcc...), the program marks the ambiguity. However, it indicates the most probable base considering its quality and its position. To find the site of a possible shift, the program calculates a probability score for each base, based on a search for similarity of 10 consecutive bases to the bases that follow any putative shift. The user may control the stringency of this search by indicating the region of search and expected value. The useful range of this value varies between 1 × 10−5 (in most stringent instances) and 1 × 10−2 . It is advisable that regions that interfere with the analysis should be excluded. Interfering regions are those that contain any tandem repeats and the tail of ‘noise’ sequences that follow at the end of sequencing of short PCR products. The program reports cases of shift mutation and predicts the sequence starting from the base with the highest statistical significance to initiate the shift. Output is also given in EXP format that allows tagging of the ambiguous bases and its incorporation into GAP4 database (Staden et al., 2000). We analyzed 22 trace files with different shift mutations to evaluate the performance of ShiftDetector. In all cases, the program was capable of detecting the shift mutation, when the proper search parameters were applied as exemplified in http://cowry.agri.huji.ac.il/ DATA SET/dataset.html. At low stringency the program will often over-predict the occurrence of shift mutations. Nevertheless, such predictions can be easily discarded when examining the predicted sequence with Basic Local Alignment Search Tool (BLAST).

REFERENCES Ewing,B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling of automated sequencer traces using Phred. Genome Res., 8, 175–185. Du,Z., Hood,L. and Wilson,R.K. (1993) Automated fluorescent

1137

E.Seroussi et al.

(a).

(b). (c).

Fig. 1. Effect of two-base insert on base calling. Direct sequencing of bi-allelic PCR product (a) is compared with the sequence of the two alleles following subcloning (b) and (c). Note the deletion of two nucleotides T and A, at positions 538 and 539, which are present in (b) and not in (c). The shaded region of (a) marks the superimposed part following this shift which hampers base calling. Traces were visualized using GAP4 program (Staden et al., 2000).

DNA sequencing ofpolymerase chain reaction products. Meth. Enzymol., 218, 104–121. Kerem,B., Rommens,J.M., Buchanan,J.A., Markiewicz,D., Cox,T.K., Chakravarti,A., Buchwald,M. and Tsui,L.C. (1989)

1138

Identification of the cysticfibrosis gene: genetic analysis. Science, 245, 1073–1080. Staden,R., Beal,K.F. and Bonfield,J.K. (2000) The Staden package, 1998. Methods Mol. Biol., 132, 115–130.

Suggest Documents