452
Genome Informatics 14: 452–453 (2003)
Detection of Processed Pseudogenes Based on cDNA Mapping to the Human Genome Hiroaki Sakai1
Kanako O. Koyanagi2
Takeshi Itoh1
[email protected]
[email protected]
[email protected]
Tadashi
Imanishi1
[email protected] 1
2 3
Takashi Gojobori3
[email protected]
Japan Biological Information Research Center, Time24 Bldg. 10F, 2-45 Aomi, Kotoku, Tokyo 135-0064, Japan Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma-shi, Nara 630-0101, Japan National Institute of Genetics, 1111 Yata, Mishima-shi, Shizuoka 411-8540, Japan
Keywords: cDNA mapping, processed pseudogene
1
Introduction
Processed pseudogenes are defined as those obtained by reverse transcription of messenger RNAs followed by reintegration into genomic DNA and subsequent degradation with disablements (premature stop codons and frameshifts). Estimates of the pseudogene population in the human genome range from about 9,000 [2] to 33,000 [1], but their exact number still remains unclear. Processed pseudogenes have been detected basically by the combination of sequence similarity searches against the known gene collection and the ratio of non-synonymous versus synonymous substitutions (K A /KS ) [3, 4]. The existence of polyadenylation sites and signals on the genome sequence is also useful information for detecting processed pseudogenes [2]. However, these methods may fail to find some processed pseudogenes such as splice variants and intron-containing pseudogenes. We attempted to detect processed pseudogenes based on cDNA mapping and sequence analysis of their 3’ terminal signals.
2 2.1
Methods cDNA Mapping
The cDNA collection used for the analysis consisted of 113,196 human cDNA sequences deposited in DDBJ, EMBL, and GenBank. Genomic fragments and non-coding RNA sequences were excluded. We conducted BLASTN search of all the cDNA sequences against the human genome sequence (NCBI build34) and extracted corresponding genomic regions for each query sequence. Then we used est2genome to align the cDNA sequence to the genomic region with a threshold of &95% identity and &90% coverage. If cDNA sequences were mapped to multiple positions on the human genome, then we selected their best locus based on the identity, length coverage, and number of exons of those sequences.
2.2
Pseudogene Detection
A candidate pseudogene was defined as a locus that was not selected as the best locus by our cDNA mapping procedure. Fig. 1 shows the flow chart of processed pseudogene detection used in our analysis. All the candidate sequences are classified into six classes according to the way in which the
Detection of Processed Pseudogenes Based on cDNA Mapping to the Human Genome
453
introns are processed. We did not consider gaps of less than 80bp to be introns (from inspection of the distribution of the intron lengths for all the mapped cDNA sequences). Then we searched for the polyadenylation signal and polyadenylation site in the 3’ region of each candidate sequence. When a polyadenylation signal was found within 50bp upstream of the candidate polyadenylation site, the sequence is labeled as “1” and “2” if the signal is found between 51 and 110bp upstream. All other sequences with a detected candidate polyadenylation site are labeled as “3”. KML E >& , ,? @"+
B'CD'E E3F
/,?@3>7@",? @ & A
B3CD'E EG
9 @"&
@+ 9 -, 0 , ,
NPO
BCD)EEH
, ,?Q3R 8 @
BCD'EEI BCD'EE S
B'C"D3EEJ !" # $&%'$( &%)%'$+*' , , ,+ -,./,0 ,,1 2 - 3 4 + $ 5 - 76 $8 ( -, 8 , 9 -,. :' - 3 4 ! 59 - 76 2;228 & -, 8 , 9 / -,