Reference Guide

Performance evaluation and analysis

15 Reference Architecture of Dell EMC Ready Solution for HPC Life Sciences | Document 309

3 Performance evaluation and analysis

3.1 Variant calling analysis performance

A typical variant calling pipeline consists of three major steps:

1) aligning sequence reads to a reference genome sequence;

2) identifying regions containing SNPs/InDels; and

3) performing preliminary downstream analysis.

In the tested pipeline, BWA 0.7.2-r1039 is used for the alignment step, and Genome Analysis Tool Kit (GATK)

is selected for the variant calling step. These are considered standard tools for aligning and variant calling in

whole genome or exome sequencing data analysis. The version of GATK for the tests is 3.6, and the actual

workflow tested was obtained from the workshop, ‘GATK Best Practices and Beyond’. In this workshop, a new

workflow with three phases was introduced:

- Best Practices Phase 1: Pre-processing

- Best Practices Phase 2A: Calling germline variants

- Best Practices Phase 2B: Calling somatic variants

- Best Practices Phase 3: Preliminary analysis

Here we tested phase 1, phase 2A and phase 3 for a germline variant calling pipeline. The details of

commands used in the benchmark are in APPENDIX A. GRCh37 (Genome Reference Consortium Human

build 37) was used as a reference genome sequence, and 50x whole human genome sequencing data from

the Illumina platinum genomes project, named ERR194161_1.fastq.gz and ERR194161_2.fastq.gz were used

for a baseline test (15).

It is ideal to use non-identical sequence data for each run. However, it is extremely difficult to collect non-

identical sequence data having more than 50x depth of coverage from the public domain. Hence, we used a

single sequence data set for multiple simultaneous runs. A clear drawback of this practice is that the running

time of Phase 2, Step 2 might not reflect the true running time as researchers tend to analyze multiple

samples together. Also, this step is known to be less scalable. The running time of this step increases as the

number of samples increases. A subtle pitfall is a storage cache effect. Since all the simultaneous runs will

read/write roughly at the same time, the run time would be slightly longer in real cases. Despite these built-in

inaccuracies, this variant analysis performance test can provide valuable insights when estimating the level of

resources required for an identical or similar analysis pipeline with a defined workload.

Total run time is the elapsed wall time from the earliest start of Phase 1, Step 1 to the latest completion of

Phase 3, Step 2. Time measurement for each step is from the latest completion time of the previous step to

the latest completion time of the current step as illustrated in Figure 6.

Figure 6 Running time measurement method