Very Low Gene Duplication Rate in the Yeast ... - Semantic Scholar

6 downloads 0 Views 248KB Size Report
The whole set of baker's yeast (Saccharomyces cerevisiae) amino acid ... For S. mikatae, S. kudriavzevii, S. bayanus, S. castellii, and S. kluyveri, data were ...
Very Low Gene Duplication Rate in the Yeast Genome Li-zhi Gao and Hideki Innan

Supporting Online Material Materials and Methods Identifying duplicated genes The whole set of baker's yeast (Saccharomyces cerevisiae) amino acid sequences was downloaded from SGD (http://www.yeastgenome.org/) in Oct, 2003 and stored as a local file. We then locally ran BLASTP (S1) to compare all pairs of protein sequences in the set, retaining gene families that have two copies in the genome. Since the list of twocopy gene families (duplicated genes) obtained by this strategy depends on the definition of a gene family (i.e., e-value cutoff of BLASTP), we used three different values (e-value = 10-10, 10-18 and 10-25) and we collected pairs of duplicated genes that were detected by at least one of the three e-values. For these pairs, we aligned DNA sequences using CLUSTAL X (S2) with a manual correction and the Ks values were estimated by PAML (S3). Then, we removed pairs with the Ks values higher than 1.05 and those with the difference in the gene length larger than 20% to identify

1

complete gene duplication events that might have occurred on the S. cerevisiae lineage relatively recently. In this screening process, we also removed all sequences that were annotated as known or suspected pseudogenes. The gene pairs (n=70) that eventually passed the above screening procedure are listed in Table S1 except for two pairs that have many copies in other species. Note that most of the gene pairs in Table S1 were identified by all three different e-value cutoffs, so that our results are not very sensitive to the cutoff values.

Database for the whole genome shotgun sequences For S. mikatae, S. kudriavzevii, S. bayanus, S. castellii, and S. kluyveri, data were collected by searching the database of Cliften et al. (S4). For S. paradoxus, data were collected by searching the databases of Kellis et al. (S5). Note that the two databases have independent genomic sequence data for S. mikatae and S. bayanus. Although we automatically used the former for these species, we also checked the other when necessary.

Identification of the orthologs of the duplicated genes in S. cerevisiae For the 68 complete duplicated genes in the S. cerevisiae genome, their orthologs were identified as follows. The two adjacent genes of a focal pair of duplicated genes (A and B for X, and C and D for Y) were determined based on the information from the NCBI Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/). With these genes as query

2

sequences, BLASTN (S1) was run against the entire genome of the six yeast relatives. The genomic sequences of each species cover about 90% of the whole genome with 570-2800 contigs (S4, S5). Based on the results of BLAST hits (evalue