White Papers

Dell - Internal Use - Confidential
12
Hence, one might want to know what the best alignment software is; however, it is hard to answer to the question since the answer can
be drawn from many different conditions. Even if we could compare all different alignment software, there will not be any conclusion
which alignment tool is the best. It really depends on your goals and the specific use case like the reason we choose to test Burrows-
Wheeler Aligner (BWA) is because it is a part of the popular variant calling workflow with Genome Analysis Toolkit (GATK) (9) (10).
Nonetheless, one of many aligners, BWA, scales stably for different numbers of cores and various NGS data size with Dell EMC
PowerEdge C6420. A single PowerEdge C6420 server is used to generate baseline performance metrics and ascertain the optimum
number of cores for running BWA for these scaling tests.
Figure 10 shows the run times of BWA on various sequence data sizes ranging from 2 to 208 million fragments (MF) and different
number of threads. Although the results show speed-up due to increasing core count in general, the optimum number of cores for BWA
is in between 12 - 20.
GENOMICS/NGS DATA ANALYSIS PERFORMANCE
A typical variant calling pipeline consists of three major steps 1) aligning sequence reads to a reference genome sequence; 2)
identifying regions containing SNPs/InDels; and 3) performing preliminary downstream analysis. In the tested pipeline, BWA 0.7.2-
r1039 is used for the alignment step and Genome Analysis Tool Kit (GATK) is selected for the variant calling step. These are
considered standard tools for aligning and variant calling in whole genome or exome sequencing data analysis. The version of GATK
for the tests is 3.6, and the actual workflow tested was obtained from the workshop, ‘GATK Best Practices and Beyond’. In this
workshop, they introduce a new workflow with three phases.
Best Practices Phase 1: Pre-processing
Best Practices Phase 2A: Calling germline variants
Best Practices Phase 2B: Calling somatic variants
Best Practices Phase 3: Preliminary analyses
Here we tested phase 1, phase 2A and phase 3 for a germline variant calling pipeline. The details of commands used in the benchmark
are in APPENDIX A. GRCh37 (Genome Reference Consortium Human build 37) was used as a reference genome sequence, and 30x
whole human genome sequencing data from the Illumina platinum genomes project, named ERR091571_1.fastq.gz and
ERR091571_2.fastq.gz were used for a baseline test (11).
It is ideal to use non-identical sequence data for each run. However, it is extremely difficult to collect non-identical sequence data
having more than 30x depth of coverage from the public domain. Hence, we used a single sequence data set for multiple simultaneous
runs. A clear drawback of this practice is that the running time of Phase 2, Step 2 might not reflect the true running time as researchers
tend to analyze multiple samples together. Also, this step is known to be less scalable. The running time of this step increases as the
Figure 10 Scaling behavior of BWA