Genes & the genetic code

In this section, we begin to uncover the mechanism by which cells read the blueprints for proteins from DNA and translate those into actual proteins.

There's a lot of chemical machinery involved, but the first step is to learn the nearly-universal genetic "code" for living things on Earth.

We know that the sequence of nucleotide bases along a DNA strand encodes the sequence of a protein, but how?

It turns out that nature has an ingenious scheme for encoding that information and making it fairly resistant to the random changes in DNA sequence are always occurring in the genomes of organisms.

The code: codons

In a DNA sequence, the code for each of the 20 naturally-occurring amino acids consists of a sequence of three nucleotide bases, which we'll just refer to with three "letters," like ATG, or CCC. Recall that A = adenine, C = cytosine, T = thymine (or U = uracil in RNA), and G = guanine.

A three-letter sequence that stands for an amino acid is called a codon.

Now let's think about that for a minute. How many different three-letter sequences can we make from the four DNA bases, A, T, C & G ? For the first base, we have a choice of four. For the second, we also have a choice of four, so there are 42 = 16 possible two-letter codes. If we multiply by 4 again, we get the number of three-letter sequences, 64.

  • The number of possible codons is 64.
  • The number of amino acids is 20.

So we have some overkill there. It turns out that there's usually more than one codon that stands for any particular amino acid. For example, there are six (TTA, TTG, CTT, CTC, CTA and CTG) that stand for leucine, two (TAT & TAC) for tyrosine, and only one (TGG) that stands for tryptophan. Tryptophan is the only amino acid with only one codon.

Look at the table of codons below. The first thing to notice that it's written using uracil instead of thymine. Recall that uracil replaces thymine in RNA, and RNA is really where the codon recognition comes in, but more on that later.

Click/tap on the table to download your own copy.

Why the redundancy?

What purpose does the redundancy in the genetic code have? It's ingenious, really. The redundancy makes a genome less susceptible to random mutations that can occur during reproduction, during normal cell operations or as a result of damage, as from radiation.

Think about it. One in every three random mutations will occur in the third base of a codon, and for most amino acids in the code table above, the change won't make any difference. For example, if a codon were to mutate from UUU to UUC, there would be no change in the amino acid it encodes; it would still be phenylalanine. The redundancy gives the genome of any organism a little bit of "robustness." It makes the genome a little resistant to random mutation.

This redundancy in codons is also often called degeneracy.


There are, however, some codons that can quite easily be changed by mutation. Take tryptophan (Trp), for example. Any mutation from the UGG codon and tryptophan will be changed to a different amino acid.

Stop codons

There are three codons that signal the protein construction machinery that it's done with whatever protein it's constructing from its blueprint. These are called STOP codons; they are UAA, UAG and UGA.

A troublesome mutation for any protein, however, can be mutation of an amino acid like tryptophan, tyrosine or histidine TO a STOP codon. That mutation could result in a protein being cut short and never being completed. If that protein is an important enzyme or structural protein, the mutation could even be lethal to the organism.

Genetic mutations that cause such a profound change in a protein that the result is death or non-viability of the organism are called lethal mutations.

Example 1

To what amino acid could a codon for glycine mutate if only one base were changed?

Solution: The four possible codons for glycine are


Clearly, a mutation of the third codon will make no difference; the codon will still encode a glycine amino acid.

Now what happens when the second codon is changed? The possible results are


where the X can stand for U, C, A or G. Codons that begin with GU, GA and GC can encode valine, alanine, aspartic acid or glutamic acid. Of the four, Asp and Glu are the most likely to produce a meaningful change in a protein. Val and Ala are all small hydrophobic residues like glycine, and most of the time one can serve the function of another in a given protein.

Finally, changes in the first position of the codon can produce

UGU, UGC (cystiene, Cys);

UGA, UGG (tryptophan, Trp);

CGU, CGC, CGA, CGG, AGA, AGG (arginine, Arg);

AGU, AGC (serine, Ser);

or a STOP codon.

Any of these could be more meaningful, if not lethal mutations in a protein, but hopefully you can see that there is some degree of built-in resistance to change in the genetic code.

The chart below presents the results in a different way. On the left we see that any of the four possible mutations in the 3rd position produces no change in the protein. The center panel shows that of four possible mutations of the second position, two lead to changes that should be innocuous and one (G → A) lead to a change that could be trouble (magenta). On the right, changes to the first position are relatively significant. Of 12 possible mutations, four are more likely to have significant consequences to a protein.



In biology, "residue" is a commonly-used synonym for "amino acid." It is used when referring to the amino acids in proteins:

"The 11th and 15th residues are tryptophan and glycine."

The start codon: ATG

The molecular machinery that assembles amino acids into a chain needs to know where to begin the process. The signal to begin is usually the codon AUG (ATG in DNA) which is the only codon that codes for the amino acid methionine (Met).

This doesn't mean that the first amino acid on all proteins is methionine. Often there is some post-processing of proteins after they are synthesized from the raw DNA sequence and before they are put to work, in which parts can be removed, and sometimes that includes the beginning.

Finding genes

Imagine that you have a long strand of DNA from a chromosome of an organism – maybe even a human – and you'd like to see what kind of protein it encodes. What you're looking for is a gene, a long stretch of DNA bases, (A, T, G, C) that encodes the amino-acid sequence of a protein. Now most proteins are pretty long, consisting of a hundred or more amino acids. That means at least 300 DNA bases for the 100 codons.

One of the problems with identifying genes on a strand of DNA is which reading frame to use. The choice of reading frame is illustrated the figure.

A short segment of DNA is shown. We can begin "reading" codons from the first position. If we do this, we find three residues, then a STOP codon. Well, that's not a very long protein, and hitting a STOP so soon indicates that we've probably chosen the wrong reading frame, so we advance by one base and see what we get.

Both the second and third choices yield good residues as far as our short sequence goes, but let's think ahead a bit. Notice from the genetic code table that three of the 64 possible codons are STOPs. That means that just about one in every 20 codons will be a STOP in a random sequence of bases. Now genes aren't random, and 100 codons isn't a very long protein.

The presence of frequent STOPs is a sure sign that we've picked the wrong reading frame, and that's basically what's done in searching for genes. We search for open reading frames, or ORFs, long stretches lacking interruption by a STOP, indicating that they aren't random.

Open reading frames (ORFs)

If, on average, a STOP codon appears roughly once per 20 codons, either you have not selected an open reading frame (ORF), or the segment of DNA of interest is unlikely to be a gene, or part of a gene.

Organism-specific bias for codons

Some organisms seem to have a 'preference' for which of the multiple codons they use to stand for the amino acids where there is redundancy or degeneracy. Human cells, for example, seem to have a bias toward the codon CUG for leucine (Leu). About 40% of all Leu codons in the human genome are CUGs. CUC is second with about 20%.

These biases are important if in the field of molecular biology, where we often use primitive organisms like bacteria and yeast to express human proteins. Changing codons to match the organism but still represent the correct amino acid can improve the yield of the desired protein.

Creative Commons License   optimized for firefox by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2016, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to