examination of genome size and content across the tree of life
Prokaryotic genome data from refseq assembly stats and feature counts were downloaded from NCBI Assembly. This included both the "Assembly Statistics Report" and the "Feature Counts Report", for both archaea and bacteria. The data of each assembly and annotation was then combined using the python script aggregate_size_and_features.py.
Eukaryotic genome data were reported in the supplement of Elliott 2015 What's in a genome? The C-value enigma and the evolution of eukaryotic genome content. Daily updated versions are taken from NCBI GENOME REPORTS.
There is a rather strict relationship of genome size to gene count among prokaryotes, of approximately 1000bp per gene.
This rule breaks down for eukaryotes, particularly the plants and animals. Larger proteins (hence larger genes) are much more common in eukaryotes, due to protein/domain fusions or repeated domains. The bulk of much larger genes, however, is junk in the form of introns. The largest human protein, titin (281kb of 365 exons), is not the largest gene, which is instead dystrophin (2Mb of 89 exons). The dystrophin protein is almost 10x smaller than the titin protein, but the gene is 10x larger. Below, the spike in human proteins at ~300-350AAs is due to olfactory receptors.
The largest bacterial genomes in this set are Sorangium cellulosum of 13.0Mb and Ktedonobacter racemifer of 13.6Mb. This is substantially larger than the largest currently assembled archaeal genome, Haloterrigena turkmenica of 5.4Mb. Nonetheless, both domains follow the gene-genome size scaling rule. Due to the single copy of the genome and absence of recombination mechanisms for most prokaryotes, the genome size would be determined by an equilibrium of gene addition (by HGT or duplication) and streamlining by stochastic gene loss. There are not many mechanisms of creating totally new genes (with respect to any given genome), either duplication and subsequent diversification, or possibly frameshifting an existing protein. As the average gene size is also around 1kb for nearly all prokaryotes, this also suggests that few genes result from fusion of other genes.
For reasons of saving the energetic cost of replication, transcription, and protein synthesis, the advantage of a reduced genome is clear. A stable environment therefore is likely to lead to streamlining (Bentkowski 2015). Conversely, this would imply that an unstable environment (rapidly changing across multiple time scales of environmental conditions, energy sources, or interaction partners) leads to genome expansion, by addition and retention of genes and pathways.
Eukaryotes on the other hand appear free from this constraint. While, at the moment, it does not appear that single-celled eukaryotes follow any such rule, it is clear that multicellular plants and animals have no correlation between genome size and gene number. Diploidy probably plays a role in this, as it becomes increasingly difficult to remove genes from a population, meaning that the stochastic removal (random) of a gene from a population would become fixed more from drift (also random) than selection by energetics.
A key relationship that follows is that of cell size to genome size (see data from Shuter 1983), and Cavalier-Smith 2005 argues that several factors influence genome size in protists.
"The central factor is cell volume. This is generally highly adaptive in both multicellular organisms and protists. A huge range (roughly 300 000-fold) of cell sizes has evolved in eukaryotes for adaptive reasons; but the spectrum is markedly different in breadth and mean in different groups, which is also adaptively explicable. The spectrum results from opposing advantages and disadvantages of small versus large cells. Cell volume for protists is the same as body size and thus fundamentally and centrally important for defining their ecological niche."
A small eukaryotic prey may be eaten by a larger predator, but a mutation in that cell could increase the cell size (via duplication, cell cycle mutation, etc). That is, if a larger genome means a larger cell, then it may be able to eat more things, and fewer things can swallow it.
This plot was based on data downloaded from the Animal Genome Size Database, and plotted with the Rscript animal_genome_size_overview_v1.R. Birds and mammals appear quite constrained in their genome sizes, though perhaps the same would be seen if the other phyla-level plots were restricted to subgroups, though Cavalier-Smith 2005 asserts that the reason for this is in relation to cell size:
"This is well shown in mammals, where all eutherian orders have essentially the same genome size, except bats which, like birds, have smaller cells to allow more rapid gas exchange by red cells during flight."
A barplot of repeats in the human genome was based on data from repeatmasker.org. This was preprocessed with the python script repeatmasker_to_summary.py and then plotted with the associated Rscript make_repeatmasker_barplot.R.
~/git/genome-size/repeatmasker_to_summary.py hg38.fa.out.gz > hg38.fa.out.summary.tab
Consequently, most of the human genome is junk, as hypothesized by Ohno in 1972. Two papers with more modern methods find roughly the same, using comparison across mammals by Lindblad-Toh 2011 or across human populations by Ward 2012.
Some other analyses made use of data from the paper:
Francis, W.R. and G. Worheide (2017) Similar ratios of introns to intergenic sequence across animal genomes Genome Biology and Evolution 9 (6): 1582-1598.
In this paper, I had shown a similar ratio of introns to intergenic sequence in a few dozen animal genomes, suggesting that the processes that shape intronic and intergenic fractions are likely similar. This has been extended to many more species, now including many chromosome-level assemblies. All raw data (for re-anaylsis or figures) and re-annotations (both GFF and protein) can be found at the associated bitbucket repository
There are a few unresolved questions with this result. The first is the consideration of the role of time in evolution. As prokaryotes do not have introns, the "original" eukaryote would have been intronless, but nonetheless had intergenic regions (single digit percent). After the symbiosis event, introns were introduced, probably from genes from the symbiont. This easily could have resutled in a genome with more intronic bases than intergenic, meaning that the ratio of intron:intergenic went from 0:1, up to maybe 10:1, and then had to stabilize at some point.
The second issue is dealing with the number of introns versus the number of intergenic blocks. The number of bases is the same, though the number of blocks is tenfold higher for introns (depending on eukaryotic species). Again, this suggests that it is the number of bases that matters, and not the number of blocks.