White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

o Best Practices Phase 1: Pre-processing

o Best Practices Phase 2A: Calling germline variants

o Best Practices Phase 2B: Calling somatic variants

o Best Practices Phase 3: Preliminary analyses

Here we tested phase 1, phase 2A and phase 3 for a germline variant calling pipeline. The details of commands used in the benchmark

are in APPENDIX A. GRCh37 (Genome Reference Consortium Human build 37) was used as a reference genome sequence, and 10x

whole human genome sequencing data from the Illumina platinum genomes project, named ERR091571_1.fastq.gz and

ERR091571_2.fastq.gz11 were used for a baseline test [13]. Further tests with 50x coverage whole genomes were done to check for

scalability of the solution as the data size increased. In addition to human genome, other mammal and plants genomes from cow, pig,

rice and corn were also tested. The details of the datasets are mentioned in APPENDIX B.

It is ideal to use non-identical sequence data for each run. However, it is extremely difficult to collect non-identical sequence data

having more than 30x depth of coverage from the public domain. Hence, we used a single sequence data set for multiple simultaneous

runs. A clear drawback of this practice is that the running time of Phase 2, Step 2 might not reflect the true running time as researchers

tend to analyze multiple samples together. Also, this step is known to be less scalable. The running time of this step increases as the

number of samples increases. A subtle pitfall is a storage cache effect. Since all the simultaneous runs will read/write roughly at the

same moment, the run time would be shorter than real cases. Despite these built-in inaccuracies, this variant analysis performance test

can provide valuable insights to estimating how many resources are required for an identical or even similar analysis pipeline with a

defined workload.

Variant Analysis test configuration

Two different configurations, Intel® OPA and IB FDR were used to benchmark BWA-GATK pipeline variant analysis. In Table 5, the

configuration of PowerEdge C6320 with Intel® OPA and FC430 with IB FDR solutions are summarized. Processors on both solutions

are nearly identical, except Intel® Xeon® Dual E5-2697 v4 has more cores, higher max turbo frequency and better max memory

bandwidth. The same versions of all applications were installed on both platforms before testing.

Table 5: Test configurations for Genomics/NGS data analysis

Component

C6320/OPA

FC430/FDR

Server

40x PowerEdge C6320

40x PowerEdge FC430 in FX2 chassis

Processor

Total 1440 cores; Intel® Xeon® Dual E5-2697 v4

- 18 cores

Processor base frequency: 2.30 GHz

Max turbo frequency: 3.60 GHz

Max memory bandwidth: 76.8 GB/s

Total of 1120 cores: Intel® Xeon® Dual E5-2695 v3

- 14 cores

Processor base frequency: 2.30 GHz

Max turbo frequency: 3.30 GHz

Max memory bandwidth: 68 GB/s

Memory

128GB-8x16GB DIMM, 2400 MT/s, dual rank, x4

data width

128GB - 8x 16GB RDIMM, 2133 MT/s, dual rank, x4

data width

Storage

480TB IEEL (Lustre)

Interconnect

Intel® OPA

IB FDR

Red Hat Enterprise 7.2

Red Hat Enterprise 6.6

Cluster Management tool

Bright Cluster Manager 7.2

Bright Cluster Manager 7.1

Snoop Mode

opportunistic snoop broadcast

early snoop

Short Sequence Aligner

BWA 0.7.2-r1039

Variant Analysis

GATK 3.5

Utilities

sambamba 0.6.0, samtools 1.2.1

Each node in the C6320/OPA solution has 36 cores whereas FC430/FDR solution provides 28 cores for each node. For a fair

comparison, two BWA-GATK pipelines are loaded on each node and any sub process uses 13 cores at most. Hence, a maximum of 26

cores are used at any one moment in each node on both solutions. In addition to CPU usage, an equal amount of memory is allocated

for every process.

The pipeline tested here is not optimized for any system, and default values are used when it is allowed. A total of 80 processes of

BWA-GATK pipelines are run simultaneously in each test and results output to Lustre storage.