White Papers
8
WHOLE GENOME SEQUENCE VARIANT ANALYSIS
GATK version 3.5 and BWA version 0.7.2-r1039 were used to benchmark variant calling on the Lustre system, while GATK version 3.6
was used for runs using the F800 and H600. The whole genome workflow was obtained from the workshop, GATK Best Practices
4
, and
its implementation is detailed here
5
and here
2
. The publicly available human genome data set used for the tests was ERR091571.
ERR091571 is one of Illumina’s Platinum Genomes from the NA12878 individual that has been used for benchmarking by many
genome analysis developers, and is relatively error free. The data set can be downloaded from the Short Read Archive (SRA) at the
European Bioinformatics Institute here
6
.
To determine the maximum sample throughput possible, an increasing number of genome samples were run on an increasing number
of compute nodes with either 2 or 3 samples being run simultaneously on each node. Batches of 64-189 samples were run on 32-63
compute nodes that mounted NFS exported directories from the F800 storage cluster. Figure 3 illustrates the wall-clock time it took for
each step in the pipeline (left axis) as well as total run time, while the right axis is a measure of how many genomes per day can be
processed utilizing a particular sample size, samples/node ratio and total compute node combination. The samples/node ratio and
number of compute nodes used per batch of samples is illustrated beneath the graph. For the 64,104 and 126 sample sizes, 32, 52 and
63 compute nodes were used, respectively, with a sample/node ratio of 2. For the 129,156,180 and 189 sample sizes, 43, 52, 60 and
63 compute nodes were used, respectively, with a sample/node ratio of 3.
Figure 3. Number of 10x WGS BWA-GATK performance results on F800. The second 126 sample plot is from a test using ~30x
genome samples (122 genomes/day).
The benchmark results in Figure 3 illustrate that when running 2 samples/compute node, the total run time is approximately 11.5 hours,
while running 3 samples/node yields an approximately 14 hour run time. While the run time is longer when running 3 samples/node the
total genomes/day throughput is higher, resulting in 325 genomes/day in the run with 189 samples. Genomes/day is calculated like so:
(24 hours/total sample run time(hours)) x number of samples = number of samples that can be processed in a 24-hour period, i.e.
genomes/day. In the case of the 189 sample run, this equates to (24 hours/13.94 hours total run time) x 189 = 325.4 genomes/day.