Cis-regulatory change and expression divergence between duplicate

2 downloads 0 Views 839KB Size Report
the first instance (id with “.1” suffix) was used for study. It is usually the longest transcript of the gene. (ii) Expression data. Gene expression information was.
Article SPECIAL TOPIC Bioinformatics

August 2010 Vol.55 No.22: 2359–2365 doi: 10.1007/s11434-010-3027-5

Cis-regulatory change and expression divergence between duplicate genes formed by genome duplication of Arabidopsis thaliana CHEN KeNian1,2, ZHANG YanBin3, TANG Tian1 & SHI SuHua1* 1

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-Sen (Zhongshan) University, Guangzhou 510275, China; Department of Biotechnology, Guangzhou Medical College, Guangzhou 510182, China; 3 College of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China 2

Received April 14, 2009; accepted August 13, 2009

As an index of functional divergence, expression divergence between duplicate gene copies has been observed and correlated with protein coding sequence divergence and bias in gene functional classes. However, the changes in the cis-regulatory region of the duplicate genes which is thought to have important role in expression divergence, has not been explored on the genome-wide scale. We analyzed functional genomics data for a large number of duplicated gene pairs formed by ancient polyploidy events in Arabidopsis thaliana. The divergence in cis-regulatory regions between two copies is positively correlated with the magnitude difference of expression. Moreover, we find that highly expressed duplicate gene pairs have a more diverged cis-regulatory region than weakly expressed gene pairs. We also show that the correlation between expression functional constraint and protein functional constraint is different in old and young duplicate pairs. Our results suggest that cis-regulatory sequence divergence contributes to the expression divergence of duplicate genes formed by genome-wide duplication. Cis-regulatory region diverges faster in highly expressed duplicate pairs. The diversify selection strengths that act on cis-regulatory region and protein coding region are negatively correlated in young duplicate pairs under expression constraint. genome duplication, duplicate gene cis-regulatory divergence, expression divergence, Arabidopsis, evolution Citation:

Chen K N, Zhang Y B, Tang T, et al. Cis-regulatory change and expression divergence between duplicate genes formed by genome duplication of Arabidopsis thaliana. Chinese Sci Bull, 2010, 55: 2359–2365, doi: 10.1007/s11434-010-3027-5

Since Ohno [1], “evolution by gene duplication” has been recognized as a general principle of biological evolution [2]. The most spectacular gene duplication is the whole-genome duplication via polyploidization which is most prominent in plants [3–5]. The model plant Arabidopsis is believed to have experienced at least three ancient polyploidy events [6–8]. The remnants of these polyploidy events comprise ~23% of the genes in current Arabidopsis genome and form a large set of duplicated chromosomal segments, which have been identified and ordered in different age classes by several groups using slightly different approaches [6,9]. Previous studies indicated that functional divergence is the most likely fate of duplicate genes retained in genome *Corresponding author (email: [email protected])

© Science China Press and Springer-Verlag Berlin Heidelberg 2010

[4,10–12]. However, the driving force of divergence is still poorly understood [13]. Although there are good reasons to consider the cisregulatory region of the duplicate genes as well, to date, almost all studies have focused on protein coding sequences [2,4,12,14–18], mainly due to the lack of a biologically relevant measure of cis-regulatory evolution that relates directly to gene expression [3,4,15,19]. Accordingly, Castillo-Davis et al. [20] described a method called the shared motif method (SMM) to quantify functional regulatory changes in cis-regulatory regions. In this paper, we adopted the SMM to investigate the questions of how cis-regulatory region changes contribute to paralogs expression divergence in Arabidopsis, the effect of genome duplication on cis-regulatory region evolution, csb.scichina.com

www.springerlink.com

2360

CHEN KeNian, et al.

Chinese Sci Bull

and the relationship between protein and regulatory sequence evolution in duplicates. To address these questions, duplicate genes derived from polyploidy events that are still lying in chromosomal segments are excellent materials for several reasons. Firstly, genes duplicated via retro-transposition would lose regulatory sequences and include additional sequences at flanking regions. Tandem duplication by unequal crossing over might even not include the entire coding and/or regulatory sequences [2,15,21]. Thus, duplicate genes created via these mechanisms differed immediately after duplication, making the analysis of cis-regulatory region divergence complicated. Secondly, genome duplication derived duplicates were created simultaneously, divergence time between pairs belonging to the same age class were the same [7,9,10]. Thirdly, segmental duplicate genes were extensively identified by several groups and yielded similar results, the reliability of data was high, and the effect of genome rearrangement was less than other types of duplicate genes [4,8].

1 Materials and methods (i) Sequence data. All gene and genomic sequence information, including intergenic distances, protein coding sequences (CDS), upstream sequences from transcription start sites, were obtained from TAIR database release 8. For genes that has more than one transcript (annotated splice variants), only the first instance (id with “.1” suffix) was used for study. It is usually the longest transcript of the gene. (ii) Expression data. Gene expression information was obtained from Nottingham Arabidopsis Stock Centre’s microarray database (NASCArrays). The dataset contained 62 ATH1 Affymetrix Arabidopsis microarray expression intensities under various experimental conditions and tissues were utilized by Blanc et al. [4]. Microarray probe intensities were normalized using MAS5.0 algorithm, i.e. the top 2% and bottom 2% of signal intensities were excluded, then the mean was calculated. The original signal values were scaled such that the mean was made equal to 100. Expression values were averaged among replicates. 128 Genes with the potential for cross-hybridization (marked with the “x” suffix on their probe ID) were discarded. We also excluded any probes that matched multiple genes. Thus, the potential of cross-hybridization was largely reduced. The genes without one expression value >150 were classified as ‘no expressed’ genes, which meant that the genes were weakly expressed. The genes left were classified as ‘expressed’ genes. The magnitude expression difference was defined as the absolute difference between maximum expression values of two copies. (iii) Protein sequence analysis. To identify dispersed pairs, the FASTA method described by Gu et al. [22] was used. Briefly, after excluded mitochondrial and cytochondrial proteins an all against all FASTA search was conducted with E80% of the longer protein, and (2) the identity between two proteins was ≥ 30% for alignments longer than 150 amino acids or ≥ (0.01n + 4.8L–0.32[1+exp(–L/1000)]) otherwise, where L is the alignable length between two proteins and n=6. Tandem duplicates were identified as duplicate genes located within 100 kb each other, separated by less than 2 non-homologous genes. Dispersed pairs were selected from families containing only two members and excluded segmental and tandem duplicates. Duplicate gene pairs derived by genome duplication were retrieved from Blanc et al. [4]. The dataset contained 2584 ‘young’ duplicate pairs from the most recent polyploidy events, and 1372 ‘old’ duplicate pairs formed in two older polyploidy events. For each duplicate pair, coding sequences were aligned by CLUSTALW [23] using the amino-acid translation of each sequence followed by back-translation into DNA sequence alignment. The maximum likelihood estimation of Ka, Ks, Ka/Ks ratio values were obtained using CODEML [24] program in the PAML package[25]. (iv) Regulatory sequence analysis. The cis-regulatory sequence analysis was achieved by using SMM (shared motifs method) which was described in details by Castillo-Davis et al. [20]. Briefly, a shared motif is defined as a region of high local similarity between two given DNA sequences without considering their order, orientation, or spacing. The SMM value was defined as the fraction of both sequences containing shared motifs. The SMM software (sharmot) was obtained from Castillo-Davis et al., and was used to calculate the divergence of upstream sequence of 100, 500, 1000, 1500 bp from transcription start sites (TSS) between each duplicate pair. To obtain the SMM distribution of upstream sequences of 3000 bp from TSS, we modified the SMM method, and created a sliding window version of SMM. The window size was 300 bp, and the step size was 100 bp, sliding over upstream sequences of 3000 bp from TSS, the average SMM value and standard error were calculated for every window. For each pairs group (i.e. Dispersed-E, Dispered-N, Old-E, Young-E, Old-N, Young-N pairs respectively), the average SMM value and standard error of each window were calculated and used to draw the distribution line. Correlation and linear multiple regression analysis was performed using R statistical package.

2 2.1

Results and discussion Identification and classification of duplicate genes

A FASTA method described by Gu et al. [22] was used to identify duplicate pairs. As mentioned earlier, we mainly focused on duplicate genes formed by genome duplication when analyzing the correlation between cis-regulatory divergence and expression divergence. We also utilized the gene families containing only two members when studying

CHEN KeNian, et al.

Chinese Sci Bull

the interplay between the expression constraint and cis-regulatory divergence, because there was less influence of other family members. A total of 1148 dispersed duplicate pairs (i.e. not tandem duplicates, not segmental duplicates) were identified. Genome duplication in Arabidopsis has been extensively studied, and polyploidy-derived duplicated gene pairs that still lying on segments have been identified by several groups using slightly different approaches [6–9], because most of their data are overlapped, the slight difference should not significantly affect the investigation. Therefore, we only performed analysis using dataset from Blanc et al. [4], which had been utilized in their study. In this dataset, 2584 ‘young’ duplicate pairs came from the most recent polyploidy events, and 1372 ‘old’ duplicate pairs formed in two older polyploidy events. We first classified all genes as ‘not expressed’ or ‘expressed’ according to their expression values across 62 microarray experiments (see materials and methods for details). In brief, the genes have expression value >150 in at least one experiment were classified as ‘expressed’ else ‘not expressed’. Genes subject to cross-hybridization were excluded, and only those genes for which a unique probe set (probe ID with ‘_at’ extension, without suffix) was available on the ATH1 microarray were retained, thus the effect of potential crosshybridization was reduced to the minimum. Among the 1148 dispersed duplicate pairs, both copies of 595 pairs were ‘expressed’ (Dispersed-E), either or both copies of 553 pairs were ‘not expressed’ (Dispersed-N). Among the 2584 young duplicate pairs, both copies of 1125 pairs were ‘expressed’ (Young-E pairs), either or both copies of 1459 pairs were ‘not expressed’ (Young-N pairs). For the 420 old duplicate pairs, both copies were ‘expressed’ (Old-E pairs), and for the 728 old pairs, either or both copies were ‘not expressed’ (Old-N pairs) (Table 1). The correlation analysis of cis-regulatory sequences divergence and expression difference were carried out using Young-E and Old-E pairs, both copies of which were ‘expressed’. 2.2 Cis-regulatory sequences divergence and expression magnitude changes Expression profile correlation is often used as an amenable

2361

August (2010) Vol.55 No.22

indicator of functional divergence between duplicate genes. However, gene expression functionality has a lot of aspects. For example, gene expression can change in spatial, temporal, and environmental dimensions, as well as in the expression levels (i.e. the transcript abundance). The former changes were referred to as changes in relative expression and the latter as changes in expression magnitude in [20]. It is impossible to summarize all of them with just a single measure. In analyzing expression patterns and cis-regulatory region divergence of C.elegans, Castillo-Davis et al. observed a correlation between cis-regulatory sequence divergence and the differences in the magnitude of expression between duplicate genes in C.elegans, by utilizing a ‘shared motif method’ (SMM, see Materias and methods for details) [20] . We adopted the SMM method to measure divergence between upstream sequences of each duplicate pair, both copies of which were ‘expressed’, and related it to the difference of maximum expression value. Keep in mind that our purpose is not to identify high confidence cis-regulatory motifs but rather to detect global quantitative trends. Because the average size of regulatory regions is not known, we first calculated that the average intergenic sequences length of Arabidopsis genome was about 1600 bp, and then we looked for shared motifs 100, 500, 1000 and 1500 bp upstream from annotated transcription start sites. For 420 Old-E pairs, we observed a highly significant correlation between dsm (i.e. 1- proportion of shared motifs in total sequence) and difference in gene expression magnitude (Spearman correlation rs = 0.169, P

Suggest Documents