Datasheet

the sequence of a protein is much more difficult than sequencing DNA —

but all the proteins that a given organism (whether microbe or human being)

can synthesize are encoded in the DNA sequence of its genome. Thus, the

smart shortcut that molecular biologists have been using is to read protein

sequences directly at the information source: in the DNA sequence! This way,

we can pretend to know the amino-acid sequence of a protein that has never

been isolated in a test tube.

Turning DNA into proteins: The genetic code

When you know a DNA sequence, you can translate it into the corresponding

protein sequence by using the genetic code, the very same way the cell itself

generates a protein sequence. The genetic code is universal (with some

exceptions — otherwise life would be too simple!), and it is nature’s solution

to the problem of how one uniquely relates a 4-nucleotide sequence (A, T, G,

C) to a suite of 20 amino acids; we’re using symbols (rather than actual chem-

icals) to do the same. Understanding how the cell does this was one of the

most brilliant achievements of the biologists of the 1960s. Yet the final

answer can be contained in a (miraculously small) table — as shown in

Figure 1-9. Have a look, but feel free to indulge in awed silence as you enter

the most sacred monument of modern biology.

Here’s how to use the table shown in Figure 1-9: From a given starting point in

your DNA sequence, start reading the sequence 3 nucleotides (one

triplet) at

a time. Then consult the genetic code table to read which amino acid corre-

sponds to the current triplet (technically referred to as

codons). For instance,

the following DNA (or messenger RNA) sequence is decoded as follows:

1. Read the DNA sequence:

ATGGAAGTATTTAAAGCGCCACCTATTGGGATATAAG

2. Decompose it into successive triplets:

ATG GAA GTA TTT AAA GCG CCA CCT ATT GGG ATA TAA G . . .

3. Translate each triplet into the corresponding amino acid:

M E V F K A P P I G I STOP

If your DNA sequence is correctly listed in the 5' to 3' orientation, you gener-

ate the protein sequence in the conventional N- to C-terminus as well. This

approach has an advantage: You don’t have to think about these orientation

details ever again.

Thus, if you know where a protein-coding region starts in a DNA sequence,

your computer can pretend to be a cell and generate the corresponding

amino-acid sequence! This simple computer translation exercise is at the

Part I: Getting Started in Bioinformatics

05_089857 ch01.qxp 11/6/06 3:52 PM Page 24