Datasheet
the sequence of a protein is much more difficult than sequencing DNA —
but all the proteins that a given organism (whether microbe or human being)
can synthesize are encoded in the DNA sequence of its genome. Thus, the
smart shortcut that molecular biologists have been using is to read protein
sequences directly at the information source: in the DNA sequence! This way,
we can pretend to know the amino-acid sequence of a protein that has never
been isolated in a test tube.
Turning DNA into proteins: The genetic code
When you know a DNA sequence, you can translate it into the corresponding
protein sequence by using the genetic code, the very same way the cell itself
generates a protein sequence. The genetic code is universal (with some
exceptions — otherwise life would be too simple!), and it is nature’s solution
to the problem of how one uniquely relates a 4-nucleotide sequence (A, T, G,
C) to a suite of 20 amino acids; we’re using symbols (rather than actual chem-
icals) to do the same. Understanding how the cell does this was one of the
most brilliant achievements of the biologists of the 1960s. Yet the final
answer can be contained in a (miraculously small) table — as shown in
Figure 1-9. Have a look, but feel free to indulge in awed silence as you enter
the most sacred monument of modern biology.
Here’s how to use the table shown in Figure 1-9: From a given starting point in
your DNA sequence, start reading the sequence 3 nucleotides (one
triplet) at
a time. Then consult the genetic code table to read which amino acid corre-
sponds to the current triplet (technically referred to as
codons). For instance,
the following DNA (or messenger RNA) sequence is decoded as follows:
1. Read the DNA sequence:
ATGGAAGTATTTAAAGCGCCACCTATTGGGATATAAG
2. Decompose it into successive triplets:
ATG GAA GTA TTT AAA GCG CCA CCT ATT GGG ATA TAA G . . .
3. Translate each triplet into the corresponding amino acid:
M E V F K A P P I G I STOP
If your DNA sequence is correctly listed in the 5' to 3' orientation, you gener-
ate the protein sequence in the conventional N- to C-terminus as well. This
approach has an advantage: You don’t have to think about these orientation
details ever again.
Thus, if you know where a protein-coding region starts in a DNA sequence,
your computer can pretend to be a cell and generate the corresponding
amino-acid sequence! This simple computer translation exercise is at the
24
Part I: Getting Started in Bioinformatics
05_089857 ch01.qxp 11/6/06 3:52 PM Page 24










