White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

INTRODUCTION

De novo assembly?

Genome assembly programs perform complex computations to join billions of fragmented reads of DNA. Short-read sequencing (Next

Generation Sequencing (NGS)) technologies turn the haploid human genome into 2- to 3-billion pieces of jigsaw puzzle with 100 copies

of each piece (1). This puzzle is complicated further by some pieces being missing or containing errors. These characteristics of

sequencing instruments impose the algorithmic challenges in assembling large, complex genomes with large amount of repetitive

regions. In order to compensate for these shortcomings in an easier way, much higher number of read depths

than 20 or more is

intentionally generated. Deep and Ultra-deep generally refer over 100 reads per nucleotide to even 1000 reads. Although NGS

technology reduced the cost of sequencing significantly, the read length becomes much shorter than the traditional Sanger sequencing,

and shorter read length makes de novo assembly much harder (2) (3). Since every read has to be compared to every other read, the

imposed time complexity is bounded to O(n

) where n represents the number of reads as shown in Figure 1 (3).

Besides nondeterministic polynomial time complexity

, the memory requirement also grows exponentially with the number of reads. As

sequence read data is represented as ‘char’ in a typical programming language, 3 billion reads with length of 100 base pairs (bp) can

be translated to 300 TB of memory. With clever ways of represent these data in memory as well as avoiding redundancies, memory

requirement can be minimized, for example, by using a Bruijn graph (4). However, due to the massive volume of data, de novo

assembly still requires extremely large amounts of memory, nearly 2 TB of memory for 3 billion reads. Hence, it is logical to favor a

shared memory architecture to assemble a genome.

Coverage (or depth) in DNA sequencing is the number of reads that include a given nucleotide in the reconstructed sequence. Deep sequencing refers

to the general concept of aiming for high number of replicate reads of each region of a sequence.

Polynomial time is said to be an algorithm can be solved in polynomial time if the number of steps required to complete the algorithm for a given input

is for some nonnegative integer, where is the complexity of the input. Nondeterministic polynomial time is a subset of polynomial time problems with

some form of nondeterminism or “trial and error”.

Figure 1 De novo assembly process: Baker, 2012