Product Data Sheet / Brochure

Versatile Entry-Level Inference
The NVIDIA A2 Tensor Core GPU provides entry-level inference with
low power, a small footprint, and high performance for NVIDIA AI at
the edge. Featuring a low-profile PCIe Gen4 card and a low 40-60
watt (W) configurable thermal design power (TDP) capability, the A2
brings adaptable inference acceleration to any server.
A2's versatility, compact size, and low power exceed the demands
for edge deployments at scale, instantly upgrading existing entry-
level CPU servers to handle inference. Servers accelerated with A2
GPUs deliver higher inference performance versus CPUs and more
efcient intelligent video analytics (IVA) deployments than previous
GPU generations—all at an entry-level price point.
NVIDIA-Certified Systems
featuring A2 GPUs and NVIDIA AI,
including the NVIDIA Triton
Inference Server, deliver breakthrough
inference performance across edge, data center, and cloud.
They ensure that AI-enabled applications deploy with fewer servers
and less power, resulting in easier deployments, faster insights, and
significantly lower costs.
Up to 20X More Inference Performance
AI inference is deployed to make consumer lives more convenient
through real-time experiences, and enables them to gain insights on
trillions of end-point sensors and cameras. Compared to CPU-only
servers, the servers built with NVIDIA A2 Tensor Core GPU offer up
to 20X more inference performance, instantly upgrading any server to
handle modern AI.
DATASHEET
NVIDIA A2 TENSOR CORE GPU
Entry-level GPU that brings NVIDIA AI to any server.
NVIDIA A2 TENSOR CORE GPU | DATASHEET | 1
SYSTEM SPECIFICATIONS
Peak FP32 4.5 TF
TF32 Tensor Core 9 TF | 18 TF¹
BFLOAT16 Tensor
Core
18 TF | 36 TF¹
Peak FP16 Tensor
Core
18 TF | 36 TF¹
Peak INT8 Tensor
Core
36 TOPS | 72 TOPS¹
Peak INT4 Tensor
Core
72 TOPS | 144 TOPS¹
RT Cores 10
Media engines 1 video encoder
2 video decoders
(includes AV1 decode)
GPU memory 16GB GDDR6
GPU memory
bandwidth
200GB/s
Interconnect PCIe Gen4 x8
Form factor 1-slot, Low-Profile PCIe
Max thermal
design power
(TDP)
40-60W (Configurable)
vGPU software
suppor
NVIDIA Virtual PC
(vPC), NVIDIA Virtual
Applications (vApps),
NVIDIA RTX Virtual
Workstation (vWS), NVIDIA
AI Enterprise, NVIDIA
Virtual Compute Server
(vCS)
¹ With sparsity
² Supported in future vGPU release
System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | omputer Vson EfcentDet-D0 (OO, 512x512) |
TensorRT 82, Precson INT8, BS8 (PU) | OpenVINO 20214, Precson INT8,
BS8 (PU)
6X 10X
8X
1X
8X2X 4X
Computer Vision (EfficientDet-DO)
System onguratonPUHPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | NLP BERT-Large (Sequence length384, SQuAD
v11) | TensorRT 82, PrecsonINT8, BS1 (PU) | OpenVINO 20214,
PrecsonINT8, BS1 (PU)
8X
Natural Language Processing (BERT-Large)
System onguratonPUHPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | Text-to-SpeechTacotron2 + Waveglow end-to-end
ppelne (nput length 128) | PyTorch 19, PrecsonFP16, BS1 (PU) | PyTorch
19, PrecsonFP32, BS1 (PU)
15X 20X 25X
20X
1X
5X 10X
Text-to-Speech (Tacotron2 + Waveglow)
MobileNet v2
0.0x
0.5x
1.0x
1.5x
Relative Performance (Video Streams 1080p30)
1.0X
1.2X
1.0X
1.3X
NVIDIA T4
ShuffleNet v2
NVIDIA A2
SystemConfiguration: [Supermicro SYS-1029GQ-TRT, 2S Xeon Gold 6240 2.6GHz,
512GB DDR4, 1x NVIDIA A2 OR 1x NVIDIA T4] | Measured performance with
Deepstream 5.1. Networks: ShuffleNet-v2 (224x224), MobileNet-v2 (224x224) |
Pipeline represents end-to-end performance with video capture and decode,
pre-processing, batching, inference, and post-processing.
A2 Improves Performance by Up to 1.3X Versus T4
IVA Performance (Normalized)
NVIDIA A2
40 65 70 75
TDP Operatng Range (Watts)
A2 Reduces Power Consumption by Up to
40% Versus T4
Lower Power and Configurable TDP
55 6045 50
NVIDIA T4
6X
7X
1X
2X 4X
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
System onguratonPUHPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | omputer VsonEfcentDet-D0 (OO, 512x512) |
TensorRT 82, PrecsonINT8, BS8 (PU) | OpenVINO 20214, PrecsonINT8,
BS8 (PU)
6X 10X
8X
1X
8X2X 4X
Computer Vision (EfficientDet-DO)
System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | NLP BERT-Large (Sequence length 384, SQuAD
v11) | TensorRT 82, Precson INT8, BS1 (PU) | OpenVINO 20214,
Precson INT8, BS1 (PU)
8X
Natural Language Processing (BERT-Large)
System onguratonPUHPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | Text-to-SpeechTacotron2 + Waveglow end-to-end
ppelne (nput length 128) | PyTorch 19, PrecsonFP16, BS1 (PU) | PyTorch
19, PrecsonFP32, BS1 (PU)
15X 20X 25X
20X
1X
5X 10X
Text-to-Speech (Tacotron2 + Waveglow)
MobileNet v2
0.0x
0.5x
1.0x
1.5x
Relative Performance (Video Streams 1080p30)
1.0X
1.2X
1.0X
1.3X
NVIDIA T4
ShuffleNet v2
NVIDIA A2
SystemConfiguration: [Supermicro SYS-1029GQ-TRT, 2S Xeon Gold 6240 2.6GHz,
512GB DDR4, 1x NVIDIA A2 OR 1x NVIDIA T4] | Measured performance with
Deepstream 5.1. Networks: ShuffleNet-v2 (224x224), MobileNet-v2 (224x224) |
Pipeline represents end-to-end performance with video capture and decode,
pre-processing, batching, inference, and post-processing.
A2 Improves Performance by Up to 1.3X Versus T4
IVA Performance (Normalized)
NVIDIA A2
40 65 70 75
TDP Operatng Range (Watts)
A2 Reduces Power Consumption by Up to
40% Versus T4
Lower Power and Configurable TDP
55 6045 50
NVIDIA T4
6X
7X
1X
2X 4X
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
System onguratonPUHPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | omputer VsonEfcentDet-D0 (OO, 512x512) |
TensorRT 82, PrecsonINT8, BS8 (PU) | OpenVINO 20214, PrecsonINT8,
BS8 (PU)
6X 10X
8X
1X
8X2X 4X
Computer Vision (EfficientDet-DO)
System onguratonPUHPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | NLP BERT-Large (Sequence length384, SQuAD
v11) | TensorRT 82, PrecsonINT8, BS1 (PU) | OpenVINO 20214,
PrecsonINT8, BS1 (PU)
8X
Natural Language Processing (BERT-Large)
System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | Text-to-Speech Tacotron2 + Waveglow end-to-end
ppelne (nput length 128) | PyTorch 19, Precson FP16, BS1 (PU) | PyTorch
19, Precson FP32, BS1 (PU)
15X 20X 25X
20X
1X
5X 10X
Text-to-Speech (Tacotron2 + Waveglow)
MobileNet v2
0.0x
0.5x
1.0x
1.5x
Relative Performance (Video Streams 1080p30)
1.0X
1.2X
1.0X
1.3X
NVIDIA T4
ShuffleNet v2
NVIDIA A2
SystemConfiguration: [Supermicro SYS-1029GQ-TRT, 2S Xeon Gold 6240 2.6GHz,
512GB DDR4, 1x NVIDIA A2 OR 1x NVIDIA T4] | Measured performance with
Deepstream 5.1. Networks: ShuffleNet-v2 (224x224), MobileNet-v2 (224x224) |
Pipeline represents end-to-end performance with video capture and decode,
pre-processing, batching, inference, and post-processing.
A2 Improves Performance by Up to 1.3X Versus T4
IVA Performance (Normalized)
NVIDIA A2
40 65 70 75
TDP Operatng Range (Watts)
A2 Reduces Power Consumption by Up to
40% Versus T4
Lower Power and Configurable TDP
55 6045 50
NVIDIA T4
6X
7X
1X
2X 4X
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU

Summary of content (3 pages)