xaktly | Virology

Genomes review

The virology pages are organized like this:


This section is meant to be a brief overview of the ways that living things on Earth store and pass genetic or "blueprint" information from one generation to the next. All living things we know of carry inside each cell a plan, in the form of a code embedded in one or more chemical polymers (DNA and/or RNA), for how to construct their progeny. Viruses, which aren't, by definition, alive, also carry such a genetic blueprint. While the genetic codes of all living things on Earth are DNA-based, viruses have evolved to use RNA for that purpose, too.

All organisms and viruses also use RNA as "messengers" to carry parts of the full code out to other parts of their cell(s) (or host cells) in order to carry out the encoded function, usually the construction of a specific protein meant for a specific purpose.

We'll begin with a review of those polymers, RNA and DNA, then talk about the code itself, how it is copied to progeny, and the mistakes that can be made along the way that can lead to functional differences between a cell or organism and its progeny.


RNA is ribonucleic acid and DNA is deoxyrobonucleic acid. One subunit (called a nucleotide phosphate) of each is shown below. Each consists of a sugar called ribose, the central ring-shaped molecule. The only chemical difference between DNA and RNA is that in a DNA subunit, the ribose is missing an oxygen atom, thus the "deoxy." This small difference actually leads to important differences between the two molecules when the subunits are linked together in long chains. More on that later. Make sure you spot that missing oxygen.

The phosphate (PO4) group is where the linkage between nucleotide phosphates occur. The DNA or RNA backbone is phosphate-(deoxy)ribose-phosphate- ... and so on.

Many thousands of these can be linked through that phosphate group via phospho-diester bonds.

The "busines end" of a nucleotide phosphate is the base. In the figure below that's called adenine — the two-ringed thing — but there are actually four nuclotide bases, which we'll discuss below. In RNA the bases are the same except that the one called thymine is replaced by another, closely-related base called uracil. Uracil is not present in DNA chains.

That's it. It's really simpler than it looks at first glance. There's a phosphate + sugar backbone that displays a base. There are four bases, and the sequence of those as displayed on the phosphate-sugar chain is what carries the code.

Pro tip:

You can think of DNA or RNA chains simply as phosphate-sugar backbones, the purpose of which is to present the bases for reading their sequence. That sequence, read in the 5' to 3' direction, is the genetic code.

The polymer strand

Here's a look at the finished product, a very short DNA chain of four nucleotide phosphates.

The bases stick together

Now let's have a look at the five bases. They are adenine, thymine, guanine and cytosine and, as noted above, in RNA thymine is replaced by uracil. We give these single-leter symbols: A, T, G, C and U.

Don't worry too much about the details just yet. For now, notice that there are similarities and differences. For example, some contain one ring-shaped part, and others two.

All of the bases hook to their respective ribose or deoxyribose sugar by losing the hydrogen atom highlighted in green and forming a bond there.

All contain one or two double-bonded oxygen atoms, which will provide the kind of weak bonding from base to base that we need to form helical shapes and other kinds of structures that protect the code by burying it inside a larger molecule.

When the bases are linked together in a long row, they are "presented" for bonding to other bases. Here's how it works:

The figures below show (left) how A and T are a perfect fit for one-another. They form hydrogen bonds between the electron clouds (those white blobs) of the oxygen atoms and hydrogens on the other.

Hydrogen bonds are much weaker than the chemical bonds, formed by the sharing of electrons, but they are of just the right strength to form structures between DNA and RNA strands that can be taken apart and put back together as needed.

The hydrogen bonding (H-bonding) between G and C is stronger because three H-bonds can form. Notice the complementary shapes between the bases. They fit together in "lock-and-key" fashion.

You can learn more about H-bonding and intermolecular forces here.

Complementarity of DNA strands

Long strands of DNA, with rare exceptions, always pair with a complementary strand through base-pair – A-T and G-C – hydrogen bonds. Notice that each strand has a direction, denoted by the labels 5' ("Five-prime") and 3'. These are just the customary labels of the carbons of the deoxyribose (or ribose) sugar ring. DNA strands are always read in the 5' to 3' direction. If we're reading the DNA duplex below from left to right, then we call the top strand (5' - 3') the sense strand, and the bottom strand the antisense or complementary strand.

Base pairing like this allows the genetic code, embedded in the sequence of bases, to be hidden until it is needed. By that we mean that a small group of enzymes (proteins with a chemical or catalytic function) can

  1. separate and copy the sense strand into an RNA as a "message" that can carry the code to the appropriate cellular machinery in order to translate that code into a needed protein.

  2. Separate and copy each strand just before a cell splits into two "daughter" cells.

These DNA duplexes coil into the familiar double-helix structure with which you are probably familiar. DNA helices are capable of further coiling into relatively thick structures called chromatin. These are so thick they're visible under a microscope. Chromatin forms into chromosomes, among other reasons, to keep the 23 long DNA strands separate and tangle-free during cell division.

Transcription – making the message

DNA encodes genes, blueprints for making proteins. In order to get that done, we need take two steps:

  1. Transcribe or copy the DNA sequence to a message (messenger RNA or mRNA) that will carry it elsewhere for step 2 ...

  2. Translate the message, building a polypeptide chain (protein) following the instructions in the code.

Now for sure, these two steps are an oversimplification of the process, but you can think of it this way: These things need to get done, they're both just a little more complicated. Those are details we can learn later. For now, these steps will do. Let's take a look at step 1:

One of those hidden steps is illustrated here. The helical DNA has to be unwound, exposing the unpaired bases of each strand. That can involve one or more enzymes, depending on the organism, but for our purposes, that's a detail we can ignore. Notice that we have one strand, the sense strand or +strand, that contains the actual information that the genome stores, and its complement, the antisense or —strand.

Once the DNA is unwound, another enzyme, RNA Polymerase II, attaches to the beginning of the gene and begins to recognize each base and construct its complement in RNA (using U instead of T, of course). What's important here is that we want a copy of the +strand, so what's actually read by DNA Pol II is the —strand. The mRNA formed from this reading is an RNA copy of the +strand.

Here are the three strands we're considering now just for review. The sequence is the same as in the figures above.


  • The DNA double helix comprises a +strand of DNA and a —strand. The + or sense strand contains the genetic code for the organism.

  • DNA is unwound and the strands are un-paired in preparation for reading.

  • The —strand is read by RNA Polymerase II (RNA Pol II) and an RNA copy (U instead of T) of the +strand is made. This is the mRNA, ready to be translated.

Translation – reading the message

Now that our message is ready, it's time to translate it to a different language, that of protein sequences. We'll have more to say about the genetic code later, but for now, just know that every three-base sequence represents one of the 20 naturally-occuring amino acids, small organic molecules that make up all proteins. The difference between two proteins is entirely determined by the difference in their amino-acid sequences. Proteins can comprise a few tens to a few thousands of amino acids.

The translation occurs at a large protein + RNA (this is a different RNA than mRNA — yes, RNA has multiple uses) complex, called the ribosome, that reads codons, finds the amino acid they represent, and links it in the correct sequence. One codon is read at a time.

In order to change from the genetic code to a sequence, we need a link between the three-base codons and amino acids. These are the transfer RNAs (tRNAs) These are abundant in the cytoplasm of cells. They are illustrated in the schematic above.

Three-letter codons are "read" by the ribosome one at a time, matched up – through base complementarity – with the proper tRNA, which carries its amino acid. The ribosome forms a bond between that amino acid and the last one in the chain, ratchets forward and moves on until it reaches a stop signal.

Here's the second amino acid. Notice that the codon AUG means methionine (Met) and GCU calls for an alanine (Ala).

In humans, about 20 amino acids per second are added to the chain in this way. That's fast! Now we need to consider the genetic code itself.


  • An mRNA finds (by random diffusion) a ribosome outside of the cell nucleus (bacteria have no nuclei). The mRNA is loaded 5' end (beginning) first.

  • The ribosome reads a three letter codon and matches it up with its complementary tRNA.

  • Positioning of the tRNA also positions its attached amino acid in the right position for ...

  • the ribosome to catalyze its chemical attachment to the growing chain.

  • This process continues until one of three STOP codons (UAA, UAG, UGA) is reached.

The genetic code

You might have already considered a mathematical inconsistency. We said there are 20 amino acids, but codons are three-base sequences made of four possible amino acids. The number of possible codons, then, is

n = 4 ยท 4 ยท 4 = 64 possible codons

That's 64 codons for only 20 amino acids. It turns out that most amino acids are represented by more than one codon, and that redundancy is a big part of evolution by natural selection and diversity through random genetic change.

Here is a table of the genetic code of all organisms on Earth:

You can download a copy of the genetic code for your notes by cliking here.

Notice, for example, that there are four possible codons for the amino acid alanine (Ala): GCU, GCC, GCA and GCG. In fact, any codon beginning with GC encodes an alanine in the protein sequence.

Why the redundancy? Well, one consequence of it is that the genome is somewhat resistant to changes in the protein sequence (which might affect its function) just because of a random mutation in the DNA sequence. Errors do occur all the time, and many organisms have special enzymes to do error checking and repair, but some slip through. In the case of an alanine codon, one-third of all such random errors (that is, errors in the last position of the codon) will have no effect at all on the protein that is build from the template.

Notice further that there are some special codons. We mentioned the three STOP codons above. These signal the ribosome that its work is done in building a protein. The codon for the amino acid methionine (Met) serves as the start codon for proteins. All proteins begin with a Met, though it could be cleaved off later in a set of processes know as post-translational modifications.

All in all, these redundancies in the genetic code give organisms a little cushion from the effects of DNA mutations that can an do occur in every organism on Earth.

One more thing: Notice that at the beginning of this section, we said that the codon table applied to every living organism on Earth, and viruses, too. That's strong evidence for the idea that everything alive on Earth today shares a common ancestor.

Uniqueness of DNA/RNA strands and proteins

It's worth considering here the number of possible combinations of RNA and DNA strands of a certain length. For example, we might ask, how many possible combinations of the four bases can we make with a three-nucleotide chain (often, short strands of DNA or RNA are called an oligonucleotides). Here's a tree diagram showing all 64 possibilities:

So there are 64 possible DNA or RNA trimers, and that's a very short chain. Let's consider a longer chain, say 20 bases. The number of possible 20-nucleotide polymers is

$$4^{20} \approx 1.1 \times 10^{12} \: \text{combinations}$$

That's 1.1 billion possible 20-base oligonucleotides. Now if we consider that most genes contain thousands of base-pairs, then you can see that the number of possible combinations is an astronomical number. The mathematics here have huge consequences and implications in the fields of genetics and forensics, among others.

Protein sequences

Now let's do the same kind of math with protein sequences. The number of naturally-occuring amino acids used by organisms on Earth to make proteins is 20, so for a 20-amino acid protein (a very, very short protein), there are

$$20^{20} = 10^{26} \; \text{combinations}$$

It's an even bigger number, of course. That's 1 followed by 26 zeros. If there's a name for that number, I don't know it. Typical proteins consist of several hundred amino acids.

Viral genomes

Viruses can have one of seven different types of genomes, only one of which is a double-stranded DNA (dsDNA) genome. Viruses can use single or double strands of DNA or RNA to encode their components, and there are further differences, too. You can study those in the section on viral genomes.

Viruses, unlike living cells, don't carry around the machinery (ribosomes and other enzymes) to make the proteins that form them. Instead they co-opt their host cells to do that for them. In the cases of RNA-based viral genomes, the viruses do have to supply their host cells with enzymes not generally found in cells, such as reverse-transcriptases, enzymes designed to construct DNA strands from RNA templates.



The progeny of an organism is its (one or more) descendents or children.

Creative Commons License   optimized for firefox
xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.