Characterization of cDNA Clones in Size-Fractionated cDNA ...

4 downloads 129675 Views 482KB Size Report
positions assigned from the 3'- and 5'-end sequences separately were coincident for 29 clones, suggesting ... using the randomly sampled clones from libraries 4 to 6 .... ATG codons were identified in 19 clones (19/39), in which ... AB007921 ... 978. ,313. 251. 254. 356. 755. 425. 245. 453 .052. 882. 499. 095. 845. 268. 153.
DNA RESEARCH 4, 345-349 (1997)

Short Communication

Characterization of cDNA Clones in Size-Fractionated cDNA Libraries from Human Brain Naohiko SEKI,* Miki OHIRA, Takahiro NAGASE, Ken-ichi ISHIKAWA, Nobuyuki MIYAJIMA, Daisuke NAKAJIMA, Nobuo NOMURA, and Osamu OHARA Kazusa DNA Research Institute, 1532-3 Yana, Kisarazu, Chiba 292, Japan (Received 19 September 1997)

Abstract To evaluate the size-fractionated cDNA libraries of human brain previously constructed (O. Ohara et al. DNA Research, 4, 53-59, 1997), the occurrence of chimeric clones and the content of clones with coding potentiality were analyzed using the randomly sampled clones with insert sizes of 5 to 7 kb. When the chromosomal location of 30 clones was determined by the radiation-hybrid mapping method, the map positions assigned from the 3'- and 5'-end sequences separately were coincident for 29 clones, suggesting that the occurrence of chimeric clones is at most 1/30. Using 91 clones mapped to chromosome 1, the content of clones that have the potentiality coding for proteins larger than 100 amino acid residues was estimated to be approximately 50% (46 out of 91 clones) on the basis of nucleotide sequence analysis and coding potentiality assay in vitro. No significant open reading frames were detected in the remaining clones. Although the clones coding for short peptides may not have been included in the above estimation, the libraries constructed from the whole brain mRNA fraction appear to contain a considerable amount of clones corresponding to the 5'-truncated transcripts in an unprocessed form and/or those with long 3'-untranslated regions. Key words: cDNA library; brain; large proteins; chromosome 1; chimera; RH mapping; in vitro transcription/translation system

An important aim of the human cDNA projects is to predict the coding sequences of genes, but little effort has been put into characterizing full-length transcripts, especially those with longer sizes in spite of their functional importance, mostly due to technical difficulties. Therefore, we have begun sequencing cDNA clones corresponding to relatively long transcripts.1'2 Recently, we constructed a series of size-fractionated cDNA libraries of human brain to investigate the cDNA clones coding for large proteins.3'4 According to the previous analysis on randomly sampled clones with an average insert size of 7.0 kb, approximately 90% of the clones contain new 5'sequences, and approximately 20% of these clones direct the synthesis of proteins with apparent molecular masses larger than 50 kDa in vitro.'1 Although the sequencing of such clones coding for large proteins is in progress,3'8 the remaining clones have not been characterized at the sequence level. In this paper, we report the results of analysis performed to estimate the approximate contents of chimeric clones and of the coding potentiality of clones, using the randomly sampled clones from libraries 4 to 6 with insert sizes of 5.3 to 7.0 kb.4 *

Communicated by Mituru Takanami To whom correspondence should be addressed. Tel. +81-43852-3932, Fax. +81-438-52-3931, E-mail: [email protected]

1.

Chimeric Clone Test by 3'- and 5'-end Mapping

Among several possible ways to estimate the content of chimeric clones, we chose the chromosomal mapping method5 which uses human/rodent radiation hybrid cell panels because, the mapping data can be used in future analyses. If chromosomal mapping is performed using the sequence information on the 3'- and 5'-ends separately, the probability that a clone mapped at the same position is chimeric should be very low. Based on this assumption, we randomly selected 30 clones with an average insert size of 5.3 kb, and designed polymerase chain reaction (PCR)-primer sets that specifically amplify products from human genomic DNA and either the 3'- or 5'end sequence of each cDNA clone. The radiation hybrid panels used were GeneBridge 4 (Research Genetics Inc., USA).6 Detailed experimental conditions for PCR and primer sequences will be described elsewhere (Seki et al., manuscript in preparation). The results of the analysis are shown in Table 1. The map positions deduced from the 3'- and 5'-end sequences independently were nearly identical or extremely close on the same chromosome regions in all clones except one clone (hhO175) one end of which was mapped on chromosome 8 and the other on

[Vol. 4,

Characterization of Human Brain cDNA Libraries

346

Table 1. Results of chimeric clone test by 3'- and 5'-end mapping. Clone Name Chromosomal Assignment and Placement(s) hhOOO4r Chromosome Chr 10 Places 3.98 cR from WI-5255 hh0004f Chromosome Chr 10 Places 2.94 cR from WI-5255

Clone Name Chromosomal Assignment and Placement(s) hhOO87f Chromosome Chrl Places 5.87 cR from Dl S2635 hh0087r Chromosome Chrl Places 5.87 cR from Dl S2635

Clone Name Chromosomal Assignment and Placement(s) hhOl 57r Chromosome Chrl8 Places 1 1.88 cR from CHLC.GATA41G05 hhO157f Chromosome Chrl 8 Places 1 1.77 cR from CHLC.GATA41G05

hh0008r

hhOO89r

hhO166r

hh0008f hhOO35r hhOO35f hhOO37r hhOO37f hh0043r hh0043f hh0047r hh0047f hhOO53r hhOO53f hh0054r hh0054f hhOO55r hhOO55f hhOO57r hhOO57f

Chromosome Chrl Places 1.51 cR from WI-4586 Chromosome Chrl Places 1.51 cR from WI-4586 Chromosome Chr3 Places 2.02 cR from WI-3771 Chromosome Chr3 Places 3.25 cR from WI-6691 Chromosome Chrl 5 Places 11.65 cR from NIB1540 Chromosome Chrl 5 Places 11.54 cR from NIB1 540

hh0089f hhOO98r hhOO98f hhO1O7r hhO 107f

Chromosome Chrl Places-OcR from NIB1 364 Chromosome Chrl Places-OcR from NIB1364

hh0109r

Chromosome Chrl 7 Places 2.02 cR from WI-9178 Chromosome Chrl 7 Places 2.02 cR from WI-9178

hhOllOr

Chromosome Chrl 6 Places 1.61 cRfrom D16S51 6 Chromosome Chrl 6 Places 1.61 cR from D16S516

hhOl 21r

Chromosome Chr7 Places -0.00 cR from D7S677 Chromosome Chr7 Places -0.00 cR from D7S677

hh0140r

Chromosome Chrl 6 Places 5.02 cR from WI-3061 Chromosome Chrl 6 Places 6.61 cR from WI-3061

hhO152r

Chromosome Chrl 4 Places 4.08 cR from WI-5773 Chromosome Chrl 4 Places 4.08 cR from WI-5773

hhO154r

hh0109f

hhOHOf

hhO121f

hhO 140f

hhO152f

hhO154f

Chromosome Chr9 Places 1.41 cR from WI-7848 Chromosome Chr9 Places 1.41 cR from WI-7848 Chromosome Chr7 Places S.34 cR from WI-6592 Chromosome Chr7 Places 5.34 cR from WI-6592 Chromosome Chr9 Places 3.25 cR from WI-8684 Chromosome Chr9 Places 3.25 cR from WI-8684 Chromosome Chrl 7 Places 3.56 cR from Dl 7S786 Chromosome Chr 17 Places 1.71 cR from WI-9205 Chromosome Chr 10 Places -0.00 cR from WI-4209 Chromosome Chr 10 Places -0.00 cR from WI-4209 Chromosome Chrl 1 Places 0.10 cR from WI-8768 Chromosome Chrl 1 Places 0.10 cR from WI-8768

hhO171r hhO17lf hhO175r hhOl 75f hh0180r hh0!80f hhO182r hhO182f hhO 19 Sr hh019 5f

Chromosome Chr3 Places 3.46 cR from CHLC.GATA8B05.420 Chromosome Chr3 Places 3.36 cR from CHLC.GATA8B05.420

hhO197r

Chromosome Chrl 0 Places 5.87 cR from WI-3914 Chromosome Chr10 Places 3.87 cR from WI-391 4

hhO2O7r

Chromosome Chrl 4 Places 6.08 cR from Dl 4S264 Chromosome Chrl 4 Places 6.08cRfromD14S264

chromosome 7. Although it is difficult to deduce the conclusion, the data suggest that the occurrence of chimeric clones in our brain cDNA libraries is at most 1/30. 2.

hhOl 66f

Sequence Analysis of Clones with Positive Signals in the in vitro Coding-capacity Assay

Chromosomal mapping was done for 1,000 clones randomly sampled from libraries 4 to 6 and 91 clones were assigned (9.1%) on chromosome 1 which represents 8.3% of the entire genome7 (Seki et al., manuscript in preparation). Using the 91 clones, we assayed their coding potentiality by the in vitro transcription/translation assay system (TNT T7-coupled reticulocyte lysate system, Promega Co., USA).4 Fifty-five clones listed in the upper section of Table 2 generated some protein signals in the visible region (longer than about 15 kDa) on SDSpolyacrylamide gel electrophoresis (PAGE). Since 6 out of the 55 clones matched with known genes in the latest DNA database (GeneBank database, release 93.0), we determined the entire sequence of the remaining 49 clones using the shotgun strategy previ-

hhO197f

hhO2O7f hhO2O8r hh0208f

Chromosome Chrl Places -0.00 cR from CHLC.GATA42A04 Chromosome Chrl Places -0.00 cR from CHLC.GATA42A04 Chromosome Chr 1 Places 1.41 cRfromDlS504 Chromosome Chrl Places 1.41 cRfromDlS504 Chromosome Chr8 Places 2.22 cR from WI-4246 Chromosome Chr7 Places 1.61 cRfromD7S484 Chromosome Chrl 8 Places -0.00 cR from WI-3058 Chromosome Chrl 8 Places -0.00 cR from WI-3058 Chromosome Chrl 4 Places 4.40 cR from WI-5773 Chromosome Chrl 4 Places 4.40 cR from WI-5773 Chromosome Chr2 Places 3.87 cR from D2S292 Chromosome Chr2 Places 3.87 cR from Wl-l 0237 Chromosome Chrl 9 Places -0.00 cR from RP_S28_1 Chromosome Chrl 9 Places 3.56 cR from AFMA134XB9 Chromosome Chrl Places-OcR from NIB1364 Chromosome Chrl Places -0 cR from NIB1364 Chromosome Chrl 0 Places 4.60 cR from WI-6602 Chromosome Chrl 0 Places 4.60 cR from WI-6602

ously described.4-Open reading frames (ORFs) longer than 100 amino acid residues were identified in 39 clones (KIAA0444 to KIAA0483), and their sizes were roughly coincident with those estimated for the in vitro products (see Table 2). Taking the clones coding for proteins larger than 50 kDa into account, the number of clones became 28 (30%). In Fig. 1, the predicted ORFs are shown by closed boxes and Alu and other repetitive sequences by dotted and hatched boxes, respectively. The average size of the cDNA inserts was 6.13 kb, and that of ORFs corresponded to 840 amino acid residues (2.5 kb). The inframe termination codons upstream of the putative first ATG codons were identified in 19 clones (19/39), in which 10 clones carried the ATG codon within the contexts of Kozak's rule.9 In Fig. 1, those ATG codons are shown by closed triangles and other ATG codons which did not match Kozak's sequence are shown by open triangles. The sequence features and expression profiles of proteins larger than 50 kDa are described in the accompanying paper8 and those of smaller proteins were deposited in the public database. Among the 55 clones which showed positive signals by the in vitro assay, 10 clones did not give any ORFs longer

N. Seki et al.

No. 5]

Table 2. Sequence data and coding potentiality of analyzed clones. Gene number (KIAA)

Accession number"

cDNA O R F length Apparent length (bp)" (amino acid molecular residues) mass (kDa)" 6,618 978 >100

EST hit"

0444 AB007913 H 1 ,313 6,471 AB007914 >100 0445 6,944 AB007915 251 30 H 0446 6,293 AB007916 0447 254 30 H 6,632 AB007917 40 H 0448 356 6,547 >100 AB007918 H 0449 755 6,946 AB007919 0450 50 425 H 245 0451 6,597 30 H AB007920 0452 6,256 50 AB007921 453 H 6,375 AB007922 .052 >100 0453 H 1, 882 6,681 0454 AB007923 >100 H 6,745 0455 AB007924 60 499 H 1,095 6,305 >100 H 0456 AB0D7925 0457 >100 6,833 H AB007926 845 6,642 1,268 AB007927 >100 H 0458 5,654 0459 AB007928 153 20 H 2,935 0460 AB007929 903 >100 H 0461 6,148 1,355 AB007930 >100 H 0462 2, 276 7,150 AB007931 >100 H 6,263 1, 963 AB007932 >100 H 0463 6,667 329 40 H 0464 AB007933 1, 697 6,282 0465 AB007934 >100 H 4,974 0466 AB007935 H 78 700 2, 055 >100 6,216 H 0467 AB007936 AB007937 6,400 71 0468 384 H 6,450 64 0469 539 H AB007938 6,456 1,460 >100 0470 H AB007939 0471 6,834 50 AB007940 370 H 5,494 AB007941 365 0472 40 H 5,747 AB007942 913 0473 >100 H 245 0474 5,591 30 H AB007943 409 0475 5,983 42 H AB007944 1,386 5,525 H >100 0476 AB007945 1,132 5,676 0477 >100 H AB007946 6,149 AB007947 56 418 0478 H 5,431 36 340 0479 H AB007948 1,252 >100 H AB007949 6,111 0480 0481 5,351 63 483 AB007950 H 0483 32 299 AB007952 H 5,201 60 6,515 NI" 0484 AB007953 60 6,565 NI H AB007954 0485 5,397 0486 60 NI AB007955 0487 6,425 70 NI AB007956 H 6,388 AB007957 NI 36 0488 5,406 AB007958 NI 45 0489 6,075 NI 62 0490 AB007959 5,717 NI 0491 AB007960 H 45 5,929 NI 0492 45 AB007961 5,734 AB007962 NI 0493 35 5,766 495 0494 NI AB007963 H 6,357 NI 0495 NI AB007964 NI 0496 AB007965 6,151 NI H 6,474 NI 0497 NI AB007966 NI 0498 NI AB007967 6,731 6,453 NI 0499 NI AB007968 6,577 NI AB007969 NI 0500 H 4,756 0501 NI NI AB007970 5,584 AB007971 0502 NI NI 5,770 0503 NI NI H AB007972 6,397 NI 0504 NI AB007973 5,809 0505 NI NI AB007974 0506 NI NI AB007975 5,951 5,617 0507 NI NI AB007976 6,013 AB007977 0508 NI NI H NI AB007978 5,641 0509 NI 5,596 0510 NI NI H AB007979 The clones with positive signals in the in vitro assay are in the upper section and those with negative signals in the lower section. " Accession numbers of DDBJ, EMBL and GenBank databases. " Values excluding poly(A) sequences. "Approximate molecular masses estimated by SDS-PAGE. »H; Hit ESTs " NI; Not identified

347

than 100 amino acid residues (KIAA0484 to KIAA0493). We assume that this is probably due to the false-positive detection of bands in the in vitro assay. By database search, 47 out of 55 clones were found to hit registered ESTs (see Table 2). 3.

Sequence Analysis of Clones Which Did Not Generate Protein Products in vitro

As described in the previous section, 36 out of 91 clones did not generate any protein products in the in vitro assay. To investigate the sequence features of these clones, the entire sequences of 17 clones were determined (KIAA0494 to KIAA0510). As expected, no ORFs larger than 100 amino acid residues were identified except in one clone (KIAA0494) which showed an ORF of 495 amino acid residues (see the lower section of Table 2). The ORF in this clone carries an in-frame termination codon upstream, but no sequence matching with Kozak's sequence is seen, suggesting that the initiation signal in the in vitro system was not strong enough to generate the apparent product. We also noted that almost the entire regions of two clones, KIAA0484 and KIAA0486, were occupied by repetitive sequences. All of the sequenced clones except for two clones composed of repetitive sequences were searched against the database (GenBank EST database, release August 1997) and six clones were found to hit ESTs (Table 2). We also investigated the expression of the sequenced clones in 14 different human tissues by the reverse transcription (RT)-PCR method,4 and found that all the investigated clones give some signals (data not shown). These data suggest that most of the clones which did not generate any protein products in the in vitro assay represent at least parts of the transcripts in an immature form or those with long 3'-untranslated regions. To summarize, the percentage of clones with the potential for coding for proteins larger than 100 amino acid residues is estimated to be roughly 50% (46 out of 91 clones) in the initial libraries, whereas the average length of the cDNA clones was 6.08 kb and the occurrence of chimera is assumed to be low. Although the clones coding for short peptides may not have been included in the above estimation, the result implies that the mRNA fraction of human brain used contained a considerable amount of un-processed transcripts or transcripts with long 3'-untranslated regions. Acknowledgments: This project was supported by grants from the Kazusa DNA Research Institute. We thank Dr. M. Takanami for his continuous support and encouragement. Thanks are also due to Kazuko Yamada, Emiko Suzuki, Kazuhiro Sato, Tomomi Kato, Seiko Takahashi, Tomomi Tajino, Keishi Ozawa, Akiko Ukigai, and Naoko Suzuki for their excellent technical assistance.

[Vol. 4,

Characterization of Human Brain cDNA Libraries

348 0

1

2

4

5

6

7

0486

0487 0488

fcfl

f"

tslT-

0447

044a

0446

0491

0450

0492

0483

I

a

B L_

0404

0453

0495

0454

0496

0455

0497

4

H

0456

H Fj

fl~

H

0457

Ini

0458

W/////////A

Lid

H

I "

V//AVAW

KT

0503

0462

0504

0463

05O5

0464

0506

0465

0466

0467

0609

0468

0510

0476

0477

04 7S

0480

V////////////////////////V////A

Y////7////A VA\

Figure 1. Physical maps of the cDNA clones analyzed. The horizontal scale represents the cDNA length in kilobases, and the gene numbers corresponding to respective cDNAs are given on the left. The ORFs and untranslated regions are shown by solid and open boxes, respectively. The positions of the first ATG codons in the cDNAs are indicated by triangles and those in the contexts of Kozak's rule are illustrated by solid triangles. Alu sequences and other repetitive sequences are represented by dotted and hatched boxes, respectively. In clones KIAA0484 and KIAA0486, almost the entire regions were occupied with Alu and LI sequences, respectively.

No. 5]

N. Seki et al.

References 1. Nomura, N., Miyajima, N., Sazuka, T. et al. 1994, Prediction of the coding sequences of unidentified human genes. I. The coding sequences of 40 new genes (KIAA0001KIAA0040) deduced by analysis of randomly sampled cDNA clones from human immature myeloid cell line KG1, DNA Res., 1, 27-35. 2. Nagase, T., Seki, N., Ishikawa, K.-I. et al. 1996, Prediction of the coding sequences of unidentified human genes. VI. The coding sequence of 80 new genes (KIAA0201KIAA0280) deduced by analysis of cDNA clones from cell line KG-1 and brain, DNA Res., 3, 321-329 3. Nagase, T., Ishikawa, K.-I., Nakajima, D. et al. 1997, Prediction of the coding sequences of unidentified human genes. VII. The complete sequences of 100 new cDNA clones from brain which can code for large proteins in vitro, DNA Res., 4, 141-150. 4. Ohara, O., Nagase, T., Ishikawa, K.-I. et al. 1997, Construction and characterization of human brain cDNA

5. 6. 7.

8.

9.

349

libraries suitable for analysis of cDNA clones encoding relatively large proteins, DNA Res., 4, 53-59. Schuler, G. D., Boguski, M. S., Stewart, E. A. et al. 1996, A gene map of the human genome, Science, 274, 540546. Gyapay, G., Schmitt, K., Fizames, C. et al. 1996, A radiation hybrid map of the human genome, Hum. Mol. Genet, 5, 339-346. Weith, A., Brodeur, G. M., Bruns, G. A. P. 1996, Report of the second international workshop on human chromosome 1 mapping 1995, Cytogenet. Cell Genet., 73, 113154. Ishikawa, K.-I., Nagase, T., Nakajima, D. et al. 1997, Prediction of the coding sequences of unidentified human genes. VIII. 78 new cDNA clones from brain which code for large proteins in vitro, DNA Res., 4, 307-313. Kozak, M. 1996, Interpreting cDNA sequences: some insights from studies on translation, Mammalian Genome, 7, 563-574.