Product Data Sheet / Brochure
Versatile Entry-Level Inference
The NVIDIA A2 Tensor Core GPU provides entry-level inference with
low power, a small footprint, and high performance for NVIDIA AI at
the edge. Featuring a low-profile PCIe Gen4 card and a low 40-60
watt (W) configurable thermal design power (TDP) capability, the A2
brings adaptable inference acceleration to any server.
A2's versatility, compact size, and low power exceed the demands
for edge deployments at scale, instantly upgrading existing entry-
level CPU servers to handle inference. Servers accelerated with A2
GPUs deliver higher inference performance versus CPUs and more
efficient intelligent video analytics (IVA) deployments than previous
GPU generations—all at an entry-level price point.
NVIDIA-Certified Systems
™
featuring A2 GPUs and NVIDIA AI,
including the NVIDIA Triton
™
Inference Server, deliver breakthrough
inference performance across edge, data center, and cloud.
They ensure that AI-enabled applications deploy with fewer servers
and less power, resulting in easier deployments, faster insights, and
significantly lower costs.
Up to 20X More Inference Performance
AI inference is deployed to make consumer lives more convenient
through real-time experiences, and enables them to gain insights on
trillions of end-point sensors and cameras. Compared to CPU-only
servers, the servers built with NVIDIA A2 Tensor Core GPU offer up
to 20X more inference performance, instantly upgrading any server to
handle modern AI.
DATASHEET
NVIDIA A2 TENSOR CORE GPU
Entry-level GPU that brings NVIDIA AI to any server.
NVIDIA A2 TENSOR CORE GPU | DATASHEET | 1
SYSTEM SPECIFICATIONS
Peak FP32 4.5 TF
TF32 Tensor Core 9 TF | 18 TF¹
BFLOAT16 Tensor
Core
18 TF | 36 TF¹
Peak FP16 Tensor
Core
18 TF | 36 TF¹
Peak INT8 Tensor
Core
36 TOPS | 72 TOPS¹
Peak INT4 Tensor
Core
72 TOPS | 144 TOPS¹
RT Cores 10
Media engines 1 video encoder
2 video decoders
(includes AV1 decode)
GPU memory 16GB GDDR6
GPU memory
bandwidth
200GB/s
Interconnect PCIe Gen4 x8
Form factor 1-slot, Low-Profile PCIe
Max thermal
design power
(TDP)
40-60W (Configurable)
vGPU software
support²
NVIDIA Virtual PC
(vPC), NVIDIA Virtual
Applications (vApps),
NVIDIA RTX Virtual
Workstation (vWS), NVIDIA
AI Enterprise, NVIDIA
Virtual Compute Server
(vCS)
¹ With sparsity
² Supported in future vGPU release
System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | omputer Vson EfcentDet-D0 (OO, 512x512) |
TensorRT 82, Precson INT8, BS8 (PU) | OpenVINO 20214, Precson INT8,
BS8 (PU)
6X 10X
8X
1X
8X2X 4X
Computer Vision (EfficientDet-DO)
System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | NLP BERT-Large (Sequence length 384, SQuAD
v11) | TensorRT 82, Precson INT8, BS1 (PU) | OpenVINO 20214,
Precson INT8, BS1 (PU)
8X
Natural Language Processing (BERT-Large)
System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | Text-to-Speech Tacotron2 + Waveglow end-to-end
ppelne (nput length 128) | PyTorch 19, Precson FP16, BS1 (PU) | PyTorch
19, Precson FP32, BS1 (PU)
15X 20X 25X
20X
1X
5X 10X
Text-to-Speech (Tacotron2 + Waveglow)
MobileNet v2
0.0x
0.5x
1.0x
1.5x
Relative Performance (Video Streams 1080p30)
1.0X
1.2X
1.0X
1.3X
NVIDIA T4
ShuffleNet v2
NVIDIA A2
SystemConfiguration: [Supermicro SYS-1029GQ-TRT, 2S Xeon Gold 6240 2.6GHz,
512GB DDR4, 1x NVIDIA A2 OR 1x NVIDIA T4] | Measured performance with
Deepstream 5.1. Networks: ShuffleNet-v2 (224x224), MobileNet-v2 (224x224) |
Pipeline represents end-to-end performance with video capture and decode,
pre-processing, batching, inference, and post-processing.
A2 Improves Performance by Up to 1.3X Versus T4
IVA Performance (Normalized)
NVIDIA A2
40 65 70 75
TDP Operatng Range (Watts)
A2 Reduces Power Consumption by Up to
40% Versus T4
Lower Power and Configurable TDP
55 6045 50
NVIDIA T4
6X
7X
1X
2X 4X
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | omputer Vson EfcentDet-D0 (OO, 512x512) |
TensorRT 82, Precson INT8, BS8 (PU) | OpenVINO 20214, Precson INT8,
BS8 (PU)
6X 10X
8X
1X
8X2X 4X
Computer Vision (EfficientDet-DO)
System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | NLP BERT-Large (Sequence length 384, SQuAD
v11) | TensorRT 82, Precson INT8, BS1 (PU) | OpenVINO 20214,
Precson INT8, BS1 (PU)
8X
Natural Language Processing (BERT-Large)
System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | Text-to-Speech Tacotron2 + Waveglow end-to-end
ppelne (nput length 128) | PyTorch 19, Precson FP16, BS1 (PU) | PyTorch
19, Precson FP32, BS1 (PU)
15X 20X 25X
20X
1X
5X 10X
Text-to-Speech (Tacotron2 + Waveglow)
MobileNet v2
0.0x
0.5x
1.0x
1.5x
Relative Performance (Video Streams 1080p30)
1.0X
1.2X
1.0X
1.3X
NVIDIA T4
ShuffleNet v2
NVIDIA A2
SystemConfiguration: [Supermicro SYS-1029GQ-TRT, 2S Xeon Gold 6240 2.6GHz,
512GB DDR4, 1x NVIDIA A2 OR 1x NVIDIA T4] | Measured performance with
Deepstream 5.1. Networks: ShuffleNet-v2 (224x224), MobileNet-v2 (224x224) |
Pipeline represents end-to-end performance with video capture and decode,
pre-processing, batching, inference, and post-processing.
A2 Improves Performance by Up to 1.3X Versus T4
IVA Performance (Normalized)
NVIDIA A2
40 65 70 75
TDP Operatng Range (Watts)
A2 Reduces Power Consumption by Up to
40% Versus T4
Lower Power and Configurable TDP
55 6045 50
NVIDIA T4
6X
7X
1X
2X 4X
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | omputer Vson EfcentDet-D0 (OO, 512x512) |
TensorRT 82, Precson INT8, BS8 (PU) | OpenVINO 20214, Precson INT8,
BS8 (PU)
6X 10X
8X
1X
8X2X 4X
Computer Vision (EfficientDet-DO)
System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | NLP BERT-Large (Sequence length 384, SQuAD
v11) | TensorRT 82, Precson INT8, BS1 (PU) | OpenVINO 20214,
Precson INT8, BS1 (PU)
8X
Natural Language Processing (BERT-Large)
System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N
22Hz, 512B DDR4 | Text-to-Speech Tacotron2 + Waveglow end-to-end
ppelne (nput length 128) | PyTorch 19, Precson FP16, BS1 (PU) | PyTorch
19, Precson FP32, BS1 (PU)
15X 20X 25X
20X
1X
5X 10X
Text-to-Speech (Tacotron2 + Waveglow)
MobileNet v2
0.0x
0.5x
1.0x
1.5x
Relative Performance (Video Streams 1080p30)
1.0X
1.2X
1.0X
1.3X
NVIDIA T4
ShuffleNet v2
NVIDIA A2
SystemConfiguration: [Supermicro SYS-1029GQ-TRT, 2S Xeon Gold 6240 2.6GHz,
512GB DDR4, 1x NVIDIA A2 OR 1x NVIDIA T4] | Measured performance with
Deepstream 5.1. Networks: ShuffleNet-v2 (224x224), MobileNet-v2 (224x224) |
Pipeline represents end-to-end performance with video capture and decode,
pre-processing, batching, inference, and post-processing.
A2 Improves Performance by Up to 1.3X Versus T4
IVA Performance (Normalized)
NVIDIA A2
40 65 70 75
TDP Operatng Range (Watts)
A2 Reduces Power Consumption by Up to
40% Versus T4
Lower Power and Configurable TDP
55 6045 50
NVIDIA T4
6X
7X
1X
2X 4X
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU
Inference Speedup
omparsons of one NVIDIA A2 Tensor ore PU versus a
dual-socket Xeon old 6330N PU
0X
NVIDIA A2
PU