New Insight into the Diversity of Life's Building Blocks: Evenness,Not Variance

Abstract

T he protein-coding alphabet has long mystified researchers: why the particular assortment of 20 “canonical” amino acids (“the 20”) should be so highly conserved remains a puzzle. Early theories that explained why there should be exactly four nucleotides and 20 amino acids, such as Gamow's “diamond code” hypothesis (Gamow, 1954) and Crick's “comma-free code” (Crick et al., 1957), were elegant but mechanistically wrong. The discovery that additional amino acids, for example, selenocysteine, pyrrolysine, and phosphoserine, were naturally inserted into some organisms and the genetic code could be artificially expanded to include dozens of other amino acids by altering tRNAs and aminoacyl-tRNA synthetases (reviewed in Liu and Schultz, 2010) made an explanation for the 20 as a fact of nature as outdated as medieval explanations for why there should be precisely seven heavenly bodies (the Sun, the Moon, and the five planets). However, the factors underlying what kind of protein-coding alphabet we should generally expect in genetic systems have been little explored.

The last comprehensive treatment of the problem of the protein-coding alphabet was the Weber and Miller study of 1981 (Weber and Miller, 1981), which gave a case-by-case account of why each of the 20 might be a useful addition to proteins and why other amino acids found by prebiotic synthesis in Urey-Miller experiments or on the Murchison meteorite might be absent. Since then, although considerable work has been done on the order of addition of amino acids to the genetic code (Trifonov, 2000; Di Giulio and Amato, 2009; Higgs, 2009), relatively little work has focused on understanding the structure of the protein-coding alphabet.

Many characteristics of amino acids have been frequently noted as inherently important for protein structure formation; these include charge, hydrophobicity, side chain volume, and propensity to appear in specific secondary structures. Similarly, many factors might be important for metabolic/biological reasons, such as the availability or complexity of synthesis (including prebiotic synthesis). Studies of amino acid properties that avoid the problems inherent in using exchange of amino acids over evolutionary time in modern proteins (Di Giulio, 2001) support hydrophobicity, size, and charge as important features of how the amino acids differ from one another (Atchley et al., 2005; Yampolsky and Stoltzfus, 2005; Stoltzfus and Yampolsky, 2007). These features have also been extensively used in studies of the optimality of the genetic code itself (see Knight et al., 1999, for review).

In this paper (Philip and Freeland, 2011), Philip and Freeland go beyond Lu and Freeland's earlier work (Lu and Freeland, 2006, 2008), and that of other quantitative models of the genetic code (Ardell and Sella, 2002; Sella and Ardell, 2002), by defining a metric for measuring the diversity of a pool of amino acids that takes into account not only the range of a parameter of interest (e.g., hydrophobicity) but its evenness across that range. This measure, the “coverage” of an amino acid pool, is more biologically and chemically meaningful because it scores most highly the pool of amino acids that maximizes heterogeneity with respect to the relevant characteristic.

Using this approach, Philip and Freeland show that evolving genetic codes tend to sample the pool of amino acids that are actually found in the 20 far more than chance would predict. Specifically, they simulated 1 million genetic codes, drawing 8 of the 50 plausible prebiotic candidates (or from a larger pool of 76 candidates including biosynthetically but not prebiotically plausible amino acids), and found that the 20 had both greater range and greater evenness than randomly chosen sets for size, charge, and hydrophobicity, or any combination of those features. The results strongly suggest that the 20 are very special compared to other sets of 20 amino acids that might have arisen. They do not, however, appear to be the best of all possible sets of amino acids on these measures; as with the codon assignments of the genetic code itself, they appear to be optimized rather than optimal (Freeland et al., 2000). In other words, although the vast majority of alternative choices are worse, there are better ones to be found when the space is searched extensively.

This paper provides an example of how computational approaches are rapidly adding new dimensions of testability to “origin of life” questions. Such approaches work best as companions to bench experiments but act as a sketch for where best to focus our time and money to fill in the details. As computational chemistry techniques continue to improve, especially in concert with our synthetic biology capabilities, it seems likely that we will soon be able to build novel organisms that will provide a view of ancestral life. Similarly, ancient environments have already been inferred from ancestral state reconstructions of the proteins they contained (Gaucher et al., 2003). The prospects for making fundamental discoveries about previously inaccessible evolutionary pathways thus seem increasingly bright.

Footnotes

Acknowledgments

We would like to thank Dan Knights, Justin Kuczynski, and Laura Wegener Parfrey for useful discussion of this manuscript.

References

Ardell

D.H.

, Sella

2002. No accident: genetic codes freeze in error-correcting patterns of the standard genetic code. Philos Trans R Soc Lond B Biol Sci, 357:1625–1642.

Atchley

W.R.

, Zhao

, Fernandes

A.D.

, Druke

2005. Solving the protein sequence metric problem. Proc Natl Acad Sci USA, 102:6395–6400.

Crick

, Griffith

, Orgel

1957. Codes without commas. Proc Natl Acad Sci USA, 43:416–421.

Di Giulio

2001. The origin of the genetic code cannot be studied using measurements based on the PAM matrix because this matrix reflects the code itself, making any such analyses tautologous. J Theor Biol, 208:141–144.

Di Giulio

, Amato

2009. The close relationship between the biosynthetic families of amino acids and the organisation of the genetic code. Gene, 435:9–12.

Freeland

S.J.

, Knight

R.D.

, Landweber

L.F.

2000. Measuring adaptation within the genetic code. Trends Biochem Sci, 25:44–45.

Gamow

1954. Possible relation between deoxyribonucleic acid and protein structures. Nature, 173:318.

Gaucher

E.A.

, Thomson

J.M.

, Burgan

M.F.

, Benner

S.A.

2003. Inferring the palaeoenvironment of ancient bacteria on the basis of resurrected proteins. Nature, 425:285–288.

Higgs

P.G.

2009. A four-column theory for the origin of the genetic code: tracing the evolutionary pathways that gave rise to an optimized code. Biol Direct, 4:16.

10.

Knight

R.D.

, Freeland

S.J.

, Landweber

L.F.

1999. Selection, history and chemistry: the three faces of the genetic code. Trends Biochem Sci, 24:241–247.

11.

Liu

C.C.

, Schultz

P.G.

2010. Adding new chemistries to the genetic code. Annu Rev Biochem, 79:413–444.

12.

, Freeland

2006. On the evolution of the standard amino-acid alphabet. Genome Biol, 7:102.

13.

, Freeland

S.J.

2008. A quantitative investigation of the chemical space surrounding amino acid alphabet formation. J Theor Biol, 250:349–361.

14.

Philip

G.K.

, Freeland

S.J.

2011. Did evolution select a nonrandom “alphabet” of amino acids? Astrobiology, 11:235–240.

15.

Sella

, Ardell

D.H.

2002. The impact of message mutation on the fitness of a genetic code. J Mol Evol, 54:638–651.

16.

Stoltzfus

, Yampolsky

L.Y.

2007. Amino acid exchangeability and the adaptive code hypothesis. J Mol Evol, 65:456–462.

17.

Trifonov

E.N.

2000. Consensus temporal order of amino acids and evolution of the triplet code. Gene, 261:139–151.

18.

Weber

A.L.

, Miller

S.L.

1981. Reasons for the occurrence of the twenty coded protein amino acids. J Mol Evol, 17:273–284.

19.

Yampolsky

L.Y.

, Stoltzfus

2005. The exchangeability of amino acids in proteins. Genetics, 170:1459–1472.