Growth and gaps in plant genome size data SUMMARY
‘Complete’ genome sequencing in angiosperms is still unable to generate C-values more accurate than those made by non-molecular methods. Introduction
‘Complete’ genome sequencing and genome size
Such measurements yield exciting new records, such as the tiny genome size estimate of c. 63 Mb for Genlisea aurea4 – a new minimum for angiosperms, only 40% of Arabidopsis thaliana (157 Mb), and 149,000 Mb for Paris japonica5 – a new maximum for eukaryotes, which greatly extends the known range of genome sizes in plants to nearly 2,400-fold.
When the first ‘complete’ genome sequence for a plant was announced in 2000 for Arabidopsis thaliana9 it was hoped this might herald a new era in genome size studies, providing the first bench mark genome size for a plant based on the total number of base pairs comprising the entire C-value. Indeed the publication of a truly complete genome sequence for Caenorhabditis elegans in 1998 suggested that such a goal might be feasible.
Genlisea aurea
Paris japonica
1C = 63.7 Mb
1C = c. 149,000 Mb
To make C-value data easily available for reference and analysis in a user-friendly form, lists of DNA amounts for flowering plant species, including both published data and values notified as personal communications, have been compiled and published in hard copy since 1976, and electronically (from 1997) in the Angiosperm DNA C-values database (since 2001 a subset of the Plant DNA C-values database).
The total DNA amount in the unreplicated haploid or gametic nucleus of an organism is referred to as its C-value (or genome size). Genome size is highly variable between plant species, and despite some intraspecific variation tends to be highly characteristic for taxa. Genome size is a key biodiversity character, of fundamental significance with many practical and predictive uses, whose study is an important element in holistic genomics.
These lists and databases have been widely used for comparative studies, and are the main source of information on plant genome size cited in the literature. Published lists of angiosperm DNA amounts have been cited over 2,700 times, whilst the Plant DNA C-values database has received over 340,000 hits (over two per hour for angiosperms) since 2001.
Published information on DNA amounts is widely scattered in diverse journals, whilst a significant amount is unpublished and unavailable. Thus it can be difficult to know whether a measurement exists for a taxon, and if so, where to find it.
The Plant DNA C-values database (release 6.0, December 2012) http://data.kew.org/cvalues/ The Plant DNA C-values database provides a one-stop, user-friendly, searchable electronic database where genome size estimates for plant species can be readily accessed and compared. The data are compiled from more than 800 original sources.
Plant DNA C-values database (release 6.0, December 2012) MD Bennett and IJ Leitch News Release of new data for over 1400 angiosperm species
The DNA amount in the unreplicated gametic nucleus of an organism is referred to as its C-value, irrespective of the ploidy level of the taxon. The Plant DNA C-values Database currently contains data for 8510 different plant species. It combines data from the Angiosperm DNA C-values Database (release 8.0, Dec. 2012), Gymnosperm DNA C-values Database (release 5.0, Dec. 2012), the Pteridophyte DNA C-values Database (release 5.0, Dec. 2012), the Bryophyte DNA C-values Database (release 3.0, Dec. 2010), together with the addition of the Algae DNA C-values database (release 4842 1.0, Dec. 2004).
20 µm
Chromosomes of Genlisea aurea (2n = c. 52) and Paris japonica (2n = 40) shown at the same magnification.
Despite their record genome sizes neither is likely to be a new model species. Genlisea aurea (the corkscrew plant) is a small rootless carnivorous species from Central Africa which traps protozoa in its modified leaves. Sequencing its tiny genome is hampered by a shortage of its DNA, free from protozoan DNA. Paris japonica (Japanese canopy plant) is a very slow growing species, endemic to Honshu Island in Japan, whose huge octoploid genome prevents complete genome sequencing.
Tracking workshop targets Workshops on plant genome size sponsored by Annals of Botany were held at Kew (1997, 2003) and Vienna (2005). Their key aims were to: Identify major gaps in our knowledge of plant DNA amounts. Agree targets and priorities for new work to fill them by informal international collaboration Monitor progress in filling them to improve representation, accuracy, and availability of genome size data. In 2003 the 2nd Plant Genome Size Workshop set three key systematic 5-year targets for angiosperms which were updated in 20116: (i) Estimate first C-values for the next 1% of angiosperms (i.e. 2% representation), & within this (ii) Achieve 75% familial representation (iii) Achieve 15% generic representation by 2015.
Species representation:
What’s new in release 6.0 1. Genome size data now available for 2.1% of angiosperms
Since release 5.0, C-values for over 1,250 angiosperm species not previously listed, have been compiled from 78 original references. This brings the total number of angiosperm species with genome size data to 7,542.
2. Genome size data for nearly all gymnosperm genera
Zonneveld recently published data for all cycad genera1 and 64 of the 67 conifer genera2. Thus, genome size data for now available for all 12 gymnosperm families and 94% of genera3.
REFERENCES [1] Zonneveld. 2012. Plant Biol. 14: 253-256.
[7] Paton et al. 2008. Taxon 57: 1371-1371.
[2] Zonneveld. 2012. Nordic Journal of Botany 30: 490-502.
[8] APG III. 2009. Bot J Linn Soc 161: 105-121.
[3] Christenhusz et al. 2011. Phytotaxa 19: 55-70.
[9] Mabberley. 2008. Mabberley's plant-book. Cambridge Univ Press.
[4] Greilhuber et al. 2006. Plant Biol 8: 770-777.
[10] Bennett et al. 2003. Ann Bot 91: 547-557.
[5] Pellicer et al. 2010. Bot J Linn Soc 164: 10-15.
[11] Tuskan et al. 2006. Science 313: 1596-1604.
[6] Bennett & Leitch. 2011. Ann Bot 107: 467-590.
[12] Guo et al. 2012. Nature Genetics: Online early Nov. 2012.
500 400 300 200 100
2005-2009
2000-2004
1995-1999
1990-1994
2010 & 2011
Period
1985-1989
1980-1984
1975-1979
1970-1974
1965-1969
1960-1964
1955-1959
0 1950-1954
7542 angiosperms from 695 original reference sources 355 gymnosperms from 48 original reference sources 128 pteridophytes from 21 original reference sources 232 bryophytes from 7 original reference sources 253 algae from 37 original reference sources
600
Mean number of estimates
• • • • •
Mean number per year of total ( ) and ‘first’ ( ) DNA C-value estimates communicated in 12 successive five year periods and the two year period 20102011, between 1950 and 2011. Data taken from the Plant DNA C-values database (release 6.0,Dec. 2012).
Generic representation:
Overall, the 7,524 angiosperm species with genome size data in release 6.0 include 1,635 genera, with 187 genera added since release 5.0. Given that 12,962 angiosperm genera are recognised8, generic representation now stands at 12.6%.
Family representation: Cummulative % of APG families
Release 6.0 contains data for 8,510 species comprising:
Analysis shows that 2010 and 2011 had the highest rate of genome size generation known (c. 450 first estimates for a species per year). With the total number of angiosperm species now estimated to be 352,0007 the percentage with genome size estimates currently stands at c. 2.1%.
60 50 40 30 20 10 0 1950 1960 1970 1980 1990 2000 2010 Year
Cumulative % of families recognised by the Angiosperm Phylogeny Group (APGIII)7 with a first C-value in the Plant DNA Cvalues database (release 6.0, Dec. 2012).
1400 2700
Differences in genome size estimated for seven plant species using either ‘complete’ genome sequencing methods ( ) or by the standard genome size estimation techniques of flow cytometry or Feulgen microdensitometry ( ).
Estimate obtained using traditional methods
1200 2500
Estimate obtained using ‘complete’ genome sequencing traditional methods
1000 2300
800
1 = Arabidopsis thaliana 2 = Brachypodium distachyon 3 = Oryza sativa ssp. indica 4 = Oryza sativa ssp. japonica 5 = Populus trichocarpa 6 = Sorghum bicolor 7 = Zea mays
600
400
If you have comments and/or suggestions contact
[email protected]
What’s in the database
It soon became clear however that in plants, what was considered to be a ‘complete’ genome sequence actually represented a combination of sequence information for all the euchromatic portion of the genome together with an estimate of the amount of DNA in the heterochromatic part. This compromise arose largely due to problems associated with sequencing highly repetitive regions of the genome. Thus, the first release of the Arabidopsis genome in 2000 actually included only 115.4 Mb of sequence data together with a rough estimate of 10 Mb for the unsequenced regions to give a total genome size of 125 Mb. This contrasted with a study by Bennett et al.10 who estimated the genome size of Arabidopsis to be 157 Mb using flow cytometry and C. elegans as the calibration standard – c. 25% larger than the estimate from complete genome sequencing.
p)
New record minimum and maximum values have greatly increased the known range of angiosperm DNA C-values to nearly 2,400-fold.
Extending the range of genome sizes
(
Plant DNA C-values database (release 6.0, December 2012).
VALUE
Jodrell Laboratory, Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AB, UK.
1C genome size (Mbp)
Genome size estimates are available for 7,542 angiosperms in the
D N
MD Bennett, E Johnston MF Fay, J Pellicer, IJ Leitch
Progress towards achieving 75% familial representation has been slow, with first C-values for only 9 families added in release 6.0. With 415 families recognised in APGIII9 familial representation is currently 60%.
200
0
1
2
3
4
5
6
7
Plant species
Neither is the situation for Arabidopsis unique. As the number of ‘complete’ plant genome sequences has risen it is clear that many include estimates for difficult-to-sequence repetitive regions of the genome and that the ‘complete’ genome size quoted in the publication is either based on adding the assembled and unassembled sequence data together (e.g. Populus trichocarpa11) or just taken from the literature. For example, even for one of the most recently published ‘complete’ genome sequences, i.e. that of watermelon, Citrullus lanatus, Guo et al.12 could not align 16.8% the DNA that had been sequenced. Instead, the genome size used to estimate the % of the genome covered by the assembly was taken from the literature using an estimate determined by flow cytometry. An analysis of the unassembled sequence data from the watermelon showed it to be comprised predominantly of transposable elements. Such a situation is not uncommon and illustrates the point that accurate incorporation of repetitive DNA sequences into whole genome assemblies still represents one of the major limitations for assembling truly ‘complete’ plant genome sequences. Clearly, while the amount of DNA sequence generated from such large scale sequencing projects is impressive and the information gained is immense, whole genome sequencing using current approaches continues to be unsuitable for providing exact genome size (C-value) measures Given the above, two important points are noted: (1) Complete genome sequencing has so far failed to produce a C-value for any angiosperm species that is more accurate than the prime value for that species previously estimated using nonmolecular methods. Genome size estimates from complete genome sequencing should thus not be used in preference to the prime value already listed in the Plant DNA C-values database. (2) Nevertheless, ‘complete genome sequencing’ has produced a highly accurate measurement of a minimum amount of DNA below which the real C-value for the sequenced taxon cannot fall.