of-Frame Stop Codons

Synonymous Permutation Reveals Selection for Less Outof-Frame Stop Codons Jingrui Zhong

Nanyan Zhu

Tsinghua University Tsinghua University, Haidian District Beijing, China, 100089 +86-18801355778

Sichuan University APT 5-5, Nanan District Chongqing, China, 400061 +86-18215668429

[email protected]

[email protected]

ABSTRACT One important source of premature stop codons is processivity error. Although Nonsense Mediated Decay (NMD) could degrade transcripts that contain a premature stop codon, it has a fitness cost. Thus, it is commonly assumed that it is advantageous for genes to stop early after a frameshift. However, we didn’t identify any pattern for excessive Out-of-frame Stop Codons (OSC) in S. cerevisiae. We shuffled the synonymous codons in genes without changing codon preference and amino acid sequence and found fewer stop codons were selected for in +1 reading frame, while no significant selection force was detected in -1 reading frame. Moreover, we checked when the first OSC appears; it is shown that there is also a selection force to avoid its early appearance. Hereby, with the support of some experimental result from another study, we raise a new hypothesis for the cost of OSC: it is more advantages to move on translating instead of truncating the peptide because the former might still to give rise to some functional products, while the latter could not. In general, the relative fitness cost of losing the possible functional products is higher than the possible costs of degrading the non-functional products.

CCS Concepts • Applied computing ➝ Life and medical sciences ➝ computational biology

Keywords synonymous codon permutation; frameshift; premature stop codon.

1. INTRODUCTION During the process of transcription and translation, frameshift can happen, resulting in functionless proteins from the correct DNA information. The frequency of frameshifting errors was estimated to be approximately 10E-5 per codon [1], which is comparable to the frequency of missense error, which is estimated to be 4 × 10E−5 to 6.9 × 10E−4 per codon in yeast [2]. Frameshifting errors are devastating, causing waste of energy to generate mRNA, protein and to degrade non-functional polypeptides. Moreover, a substantial fraction of frameshifting errors could give rise to misfolded protein [3]. Those proteins might be toxic to the cell, to name a few, by mis-interaction with other proteins, or aggregate with themselves. However, frameshift sometimes has functional

roles [4]. For instance, the gag ORF of yeast L-A dsRNA virus benefits from a certain efficiency of frameshifting to gag-pol fusion protein [5]. Frameshifting causes fitness defects and is lessened by organisms in multiple ways. First, coding regions tend to have less frameshifting mutation compared with non-coding regions. Microsatellite expansion in coding DNA tend to be limited to triand hexanucleotide repeats because other types of microsatellite (mono-, di- tetra-, pentanucleotide repeats) could result in frameshift which was selected against [6]. Second, some genes coding for proteins with N-terminal regions that are insensitive to amino acid sequence changes [7]. Frameshift happens in multiple processes of metabolism, including DNA replication, transcription, and translation. Previous models of DNA replication and transcriptional frameshift mainly focus on the misalignment of tandem repeats [7], while frameshift in translation is generally characterized as a “pause-and-slip” model: translational pause followed by tRNA slippage. Here we mainly focus on frameshift that happens in transcription and translation without distinguishing between them, because they both give rise to similar results: a portion of products become functionless but does not affect DNA sequence. Codon usage bias refers to differences in the frequency of occurrence of synonymous codons in coding DNA. It is well recognized that some of the synonymous codons that are used in protein-coding sequences are preferentially used in many organisms, especially in fast-growing microorganisms. There are multiple ways to measure the codon bias, including Codon Adaptation Index, which refers to how frequently codons preferred by highly expressed genes are used in a given gene. Some reports indicate that gradients in nucleotide and codon usage vary along genes [8], but it has long been neglected how the permutation or order of synonymous codons is going to be selected in a gene. We focus on the order of synonymous codons and try to assess the extant codon order and its effects on out-offrame stop codons. By reshuffling the synonymous codon, keeping the CAI and amino acid sequences unchanged, we can get new sequences with different “codon” composition after frameshifting. Out-of-frame stop codons (OSCs) occur naturally in coding sequences, contributing to the Nonsense-mediated decay (NMD) or other related mechanisms [9]. However, due to the fact that out-of-frame RNA and protein was usually degraded quickly so that it is hard to detect, and that OSCs can occur randomly even if there is no selection, so selection for OSCs has long been largely neglected.

In the last few years, several studies have shed light on this selection pressure on premature-codon after frameshifting. Previous research has shown that gene sequences with high CAI, on average, appear to be more resistant to +1 frameshifting error, and frameshifting robustness of gene sequences is higher than expected by chance [10]. Another large-scale study by Herman et al. showed that low G+C content coding sequences contain significantly more OSCs and there is an overrepresentation of OSC compare with expected frequencies simulated using Monte Carlo approach [11]. In this study, instead of referring complex models, we use an assumption-free bootstrap method to permuting synonymous codons and found that there seems to be an underrepresentation of stop codon in +1 frameshifting sequences in S. cerevisiae.

2. METHODS Data. The gene-coding sequences of S. cerevisiae were obtained from the Saccharomyces Genome Database (http://downloads.yeastgenome.org/sequence/S288C_reference/or f_dna/). Sequences were translated based on the common codons table.

We use the chi-square test to show the significance of differences. We assume that if there is no selection, P should follow a uniform distribution. For FSC data, we removed the 45.5-55.5% bin before performing the test.

3. RESULTS OSC Number Tend to Be Fewer in -1 Reading Frame. After reshuffling for 500 times, we calculated the total number of stop codons in -1 Reading Frame (Figure 1). A total of 5,887 yeast ORFs were analyzed to get the P value. Specifically when P is low, it means that the actual sequence tends to have fewer stop codons than permutated ones, and vice versa. To the contrary of our prediction that OSCs were selected to be overrepresented [11], it turns out that more genes had much less stop codons than expected. Out of 5,887 genes, 2,320 (39.4%) genes appeared to have a P that is less than 10%, which means less than 10% of the permutated sequences had more stop codons than the actual one, while only 1,050 (17.8%) genes have a P value bigger than 50%. A power model or an exponential model (Figure 1) could fit the 10-bin histogram distribution well (R2=0.9648, R2=0.9421 respectively).

Reshuffle and permutations. For every gene, 500 random permutations of synonymous codon were generated, so that the newly rearranged sequences have exactly the same amino acid sequence and codon usage bias, but have different nucleotide sequences. More specifically, we record the codon usage for each one of the 20 amino acids, as well as the stop codons. Then we reassigned each amino acid in the sequence by taking one of the used codons without replacement. This procedure is carried on separately for each amino acid, from the starting points to the end. In this design, both Codon Adaptation Index (CAI) and amino acid composition were not changed in the newly generated sequence. Frameshifting. We generated +1, -1 reading frames for each sequence. There are six open reading frames in a sequence because three of them are in the opposite direction and one of them has the authentic reading frame; we have considered all the possible scenarios of frameshifting in the same direction. For the two alternative reading frames, “+1” means shifting downstream for two nucleotides, while “-1” means shifting upstream for one nucleotide, i.e., shifting downstream for two nucleotides.

Figure. 1. Distribution of P parameter that quantifies the relative number of OSC in -1 reading frame. The Y-axis shows the number of genes that have P fraction of permuted sequences with an OSC number smaller than the original sequence. The inset figure shows that a power model can fit the 10-bin distribution well.

First stop position and stop numbers in alternative reading frame. After frameshifting each randomly shuffled sequence, we calculate the position of First Stop Codon (FSC), represented by the length of the truncated peptide chain. This parameter marks the how quickly cells can stop translating if the frameshifting happens in the initiation process. Second, we calculate the total Number of Stops Codons (NSC), which represents how quickly cells can stop on average when a frameshift happens at a random position.

First Stop Codon Happens Towards the Starting Points in -1 Reading Frame. If there is a high probability for frameshift to happen at the beginning of the -1 reading frame, it is likely that the first stop codon should happen earlier to stop the frameshifting translation; however, we found no such bias towards the starting points (Figure 2). Note that there is a comparatively high peak in p=50. Those are the cases where reshuffling fails to change the first stop codon position after frameshift, for example, that is, the first motif NNT[GA|AA|AG]N could not be changed after reshuffling.

Statistical analysis. After bootstrapping for 500 times, we compare the FSC and NSC in simulated sequences and actual gene sequence. Then we can calculate the probability when the former sequence is smaller than the actual one. In specific, if the FSCs for two sequences are the same, we count it as a half, while if the FSC of the permutated sequence is smaller than the first, we count it as one. Finally, we divided the number by 500, which is the total number of rearranged sequences to generate the P value.

If there is no selection, and reshuffling is effective in changing the relative position of the first stop codon, then we would observe a uniform distribution of P, with a height of approximately 57.5 genes in each column. However, in reality, only 259 (less than 4.6%) was distributed under the 1-10% column. While an excess of 727 (12.8%) genes appear under the 90-100% column (Chisquare test, p=3.58x10E-64).

Figure. 2. Distribution of P parameter that quantifies the relative position of the first out-of-frame stop codons in +1 reading frame. The Y-axis shows the number of genes. P