White Papers
4
INTRODUCTION
De novo assembly?
Genome assembly programs perform complex computations to join billions of fragmented reads of DNA. Short-read sequencing (Next
Generation Sequencing (NGS)) technologies turn the haploid human genome into 2- to 3-billion pieces of jigsaw puzzle with 100 copies
of each piece (1). This puzzle is complicated further by some pieces being missing or containing errors. These characteristics of
sequencing instruments impose the algorithmic challenges in assembling large, complex genomes with large amount of repetitive
regions. In order to compensate for these shortcomings in an easier way, much higher number of read depths
1
than 20 or more is
intentionally generated. Deep and Ultra-deep generally refer over 100 reads per nucleotide to even 1000 reads. Although NGS
technology reduced the cost of sequencing significantly, the read length becomes much shorter than the traditional Sanger sequencing,
and shorter read length makes de novo assembly much harder (2) (3). Since every read has to be compared to every other read, the
imposed time complexity is bounded to O(n
2
) where n represents the number of reads as shown in Figure 1 (3).
Besides nondeterministic polynomial time complexity
2
, the memory requirement also grows exponentially with the number of reads. As
sequence read data is represented as ‘char’ in a typical programming language, 3 billion reads with length of 100 base pairs (bp) can
be translated to 300 TB of memory. With clever ways of represent these data in memory as well as avoiding redundancies, memory
requirement can be minimized, for example, by using a Bruijn graph (4). However, due to the massive volume of data, de novo
assembly still requires extremely large amounts of memory, nearly 2 TB of memory for 3 billion reads. Hence, it is logical to favor a
shared memory architecture to assemble a genome.
1
Coverage (or depth) in DNA sequencing is the number of reads that include a given nucleotide in the reconstructed sequence. Deep sequencing refers
to the general concept of aiming for high number of replicate reads of each region of a sequence.
2
Polynomial time is said to be an algorithm can be solved in polynomial time if the number of steps required to complete the algorithm for a given input
is for some nonnegative integer, where is the complexity of the input. Nondeterministic polynomial time is a subset of polynomial time problems with
some form of nondeterminism or “trial and error”.
Figure 1 De novo assembly process: Baker, 2012