Short fuzzy tandem repeats in genomic sequences, identification, and ...

76 downloads 0 Views 619KB Size Report
tion (Carroll et al., 2001; Davidson et al., 2000). Repeats of various types may also be ..... human X-chromosome (Ross et al., 2005). However, in the case.
BIOINFORMATICS

ORIGINAL PAPER

Vol. 22 no. 6 2006, pages 676–684 doi:10.1093/bioinformatics/btk032

Genome analysis

Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression Valentina Boeva1, , Mireille Regnier2, Dmitri Papatsenko3 and Vsevolod Makeev4,5 1

Department of Bioengineering and Bioinformatics, Moscow State University, Moscow, Russia, INRIA Rocquencourt, France, 3University of California, Berkeley, USA, 4State Research Center GosNIIGenetika, Moscow, Russia and 5Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia

2

Received on October 14, 2005; revised on December 22, 2005; accepted on December 28, 2005 Advance Access publication January 10, 2006 Associate Editor: Steven L. Salzberg ABSTRACT Motivation: Genomic sequences are highly redundant and contain many types of repetitive DNA. Fuzzy tandem repeats (FTRs) are of particular interest. They are found in regulatory regions of eukaryotic genes and are reported to interact with transcription factors. However, accurate assessment of FTR occurrences in different genome segments requires specific algorithm for efficient FTR identification and classification. Results: We have obtained formulas for P-values of FTR occurrence and developed an FTR identification algorithm implemented in TandemSWAN software. Using TandemSWAN we compared the structure and the occurrence of FTRs with short period length (up to 24 bp) in coding and non-coding regions including UTRs, heterochromatic, intergenic and enhancer sequences of Drosophila melanogaster and Drosophila pseudoobscura. Tandems with period three and its multiples were found in coding segments, whereas FTRs with periods multiple of six are overrepresented in all non-coding segment. Periods equal to 5–7 and 11–14 were characteristic of the enhancer regions and other non-coding regions close to genes. Availability: TandemSWAN web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/projects/ swan/www/ Contacts: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

1

INTRODUCTION

Eukaryotic genomes contain many types of repetitive sequences, such as long repeats, satellite DNA and many other yet unclassified sequences of various lengths and levels of repetitiveness (Singer and Berg, 1991). So far, the efforts of researchers have been predominantly focused on nearly perfect repeats such as microsatellites and others (Li et al., 2002). Analysis of more divergent (fuzzy) tandem repeats was complicated by problems related to their discrimination from background and insufficient annotation level of genomes. In this study we focus on fuzzy tandems containing n occurrences (n > 2) of a mismatched word with period of T bases (T  3–24) without insertions or deletions. Tandem repeats are usually 

To whom correspondence should be addressed.

676

classified into microsatellites (1–6 bp), minisatellites (6–24 bp, and in some cases longer) (Vergnaud and Denoeud, 2000) and ‘classical’ satellites. The length scale of fuzzy repeats considered here corresponds to micro- and minisatellite repeat classes. However, we do not consider periods with T ¼ 1 or 2, as they correspond to poly-A or TATA-like sequence, a different biological object explored elsewhere (Katti et al., 2001; Schug et al., 1998; Subramanian et al., 2003). Fuzzy tandem repeats (FTRs) have been found in regulatory regions of eukaryotic genes (Shi et al., 2000); such tandems sometimes form cooperative arrays of binding sites and interact with transcription factors (Gao and Finkelshtein, 1998; Ott and Hansen, 1996; Meloni et al., 1998; Ramchandran et al., 2000). However, it is still unclear (1) how to define and extract fuzzy tandems, (2) whether functionally different sequences are enriched by tandems of a specific structure and (3) what biological function (if any) fuzzy tandems perform in genome. If the genome distribution of FTRs is uneven, their exploration should help to locate structural/functional sequence categories and to understand underlying mechanisms of their function. The degree of FTR propagation varies from one genome to the other and from one functional sequence category to the other; existing algorithms (Benson, 1999; Kolpakov et al., 2003) return up to 10–15% of the Drosophila melanogaster and >10% (Benson, 1999) of the human genome as tandem repeats of various structure. Accumulation of tandems in genomes is a result of errors during replication and some rearrangement events (Dover, 1982; Singer and Berg, 1991, Ellegren, 2004). From that perspective, much of repetitive genomic DNA might be considered as non-informative; however, there are cases where presence of tandems is tightly linked to a biological function (Nakamura et al., 1998). For instance, long tandem repeats constitute a large portion of heterochromatin satellite DNA and are involved in centromere formation and function (Martienssen, 2003); sometimes presence of long tandems even serves as a signal of extra centromere formation (Singer and Berg, 1991). Much less is known about the role of shorter repetitive sequences, especially highly mismatched fuzzy tandems (FTRs), quite abundant in exons, introns and transcription regulatory sequences (Nakamura et al., 1998). In exons, FTRs may reflect sequence periodicities existing in protein sequence or even structural features, such as hydrophobic helices (Katti et al., 2000; Li et al., 2004); it is unclear if these tandems have any function at

 The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Short fuzzy tandem repeats in genomic sequences

the DNA level. In complex eukaryotic regulatory regions, such as enhancers and silencers, FTRs appear to be linked with some types of binding sites for transcription factors (Antoniewski et al., 1996; Ott and Hansen, 1996; Ramchandran et al., 2000). One of the attractive models suggests that an FTR with a unit consensus similar to a binding site modulates exact response to regulator concentration (Carroll et al., 2001; Davidson et al., 2000). Repeats of various types may also be important for regulation that controls spatial packaging/dynamics of eukaryotic DNA. Thus, 8–16 bp repeats separated by distance