White Papers

Dell - Internal Use - Confidential

Hence, one might want to know what the best alignment software is; however, it is hard to answer to the question since the answer can

be drawn from many different conditions. Even if we could compare all different alignment software, there will not be any conclusion

which alignment tool is the best. It really depends on your goals and the specific use case like the reason we choose to test Burrows-

Wheeler Aligner (BWA) is because it is a part of the popular variant calling workflow with Genome Analysis Toolkit (GATK) (9) (10).

Nonetheless, one of many aligners, BWA, scales stably for different numbers of cores and various NGS data size with Dell EMC

PowerEdge C6420. A single PowerEdge C6420 server is used to generate baseline performance metrics and ascertain the optimum

number of cores for running BWA for these scaling tests.

Figure 10 shows the run times of BWA on various sequence data sizes ranging from 2 to 208 million fragments (MF) and different

number of threads. Although the results show speed-up due to increasing core count in general, the optimum number of cores for BWA

is in between 12 - 20.

GENOMICS/NGS DATA ANALYSIS PERFORMANCE

A typical variant calling pipeline consists of three major steps 1) aligning sequence reads to a reference genome sequence; 2)

identifying regions containing SNPs/InDels; and 3) performing preliminary downstream analysis. In the tested pipeline, BWA 0.7.2-

r1039 is used for the alignment step and Genome Analysis Tool Kit (GATK) is selected for the variant calling step. These are

considered standard tools for aligning and variant calling in whole genome or exome sequencing data analysis. The version of GATK

for the tests is 3.6, and the actual workflow tested was obtained from the workshop, ‘GATK Best Practices and Beyond’. In this

workshop, they introduce a new workflow with three phases.

 Best Practices Phase 1: Pre-processing

 Best Practices Phase 2A: Calling germline variants

 Best Practices Phase 2B: Calling somatic variants

 Best Practices Phase 3: Preliminary analyses

Here we tested phase 1, phase 2A and phase 3 for a germline variant calling pipeline. The details of commands used in the benchmark

are in APPENDIX A. GRCh37 (Genome Reference Consortium Human build 37) was used as a reference genome sequence, and 30x

whole human genome sequencing data from the Illumina platinum genomes project, named ERR091571_1.fastq.gz and

ERR091571_2.fastq.gz were used for a baseline test (11).

It is ideal to use non-identical sequence data for each run. However, it is extremely difficult to collect non-identical sequence data

having more than 30x depth of coverage from the public domain. Hence, we used a single sequence data set for multiple simultaneous

runs. A clear drawback of this practice is that the running time of Phase 2, Step 2 might not reflect the true running time as researchers

tend to analyze multiple samples together. Also, this step is known to be less scalable. The running time of this step increases as the

Figure 10 Scaling behavior of BWA