Datasheet
sequencing technologies improved steadily, but such technologies still
tended to concentrate on mining individual genes for information. During this
period, biologists were mostly sequencing DNA fragments that were a few
thousand nucleotides in length, simply because they were interested in spe-
cific genes that they had started working on years before. Most of the bioin-
formatics tools available today were created during that period. They include
All basic sequence-alignment programs
Phylogenetic and classification methods
Various display tools adapted to relatively small-sequence objects (such
as protein sequences no more than a few thousand characters long)
Genomics: Getting all the genes at once
The determination of the first complete genome sequence terminated the
gene-by-gene routine and initiated the era of
genomics, the genetic mapping,
physical mapping, and sequencing of entire genomes. As a consequence, the
DNA sequences we have to work with now are much longer — close to a
million-bp in length for microbes and up to several billion-bp in length for
animals and humans. This revolution called for the design of new bioinformatic
tools and databases capable to store, query, analyze, and display these huge
objects in a user-friendly manner. Chapters 3, 5, and 7 present some of the
questions that biologists address at the genome scale, and show the relevant
bioinformatic tools in action.
In contrast to the early days of the gene-by-gene approach, DNA sequences
are now often obtained (along with the presumed protein sequences derived
from those DNA sequences) without any prior knowledge of what is actually
there. In essence, genes are both sequenced
and discovered at the same time.
This development prompted the emergence of an entirely new branch of
bioinformatics devoted to the parsing of large DNA sequences into their
components (genes, transcription units, protein-coding regions, regulatory
elements, and so forth). This first pass is then followed by a longer phase of
genome
annotation, where the biological functions of these various elements
are (more or less tentatively) predicted. Part IV of this book presents you
with some of these most advanced techniques.
Figure 1-10, representing the whole genome of the bacterium
Rickettsia
conorii,
illustrates this new level of complexity. This circular DNA molecule is
1.3 million bp long, on the small side for a bacterium. Each little rectangle in
the two most external circles of features (one circle per strand) corresponds
to a protein-coding gene in the circular genome. Each rectangle corresponds
to approximately 1000 bp. Nobody knew which genes — or which proteins —
were in that bacterium before the sequencing started. Almost everything we
know now about this bacterium (and many others we can describe as fairly
inaccessible, such as those thriving on the ocean floor near volcanic vents at
100°C) has been derived from bioinformatic analyses.
27
Chapter 1: Finding Out What Bioinformatics Can Do for You
05_089857 ch01.qxp 11/6/06 3:52 PM Page 27










