Product Data Sheet / Brochure

ManualsBrandsNvidia ManualsComponents & AccessoriesA2 GPU computing processor - A2 - 16 GB

Versatile Entry-Level Inference

The NVIDIA A2 Tensor Core GPU provides entry-level inference with

low power, a small footprint, and high performance for NVIDIA AI at

the edge. Featuring a low-proﬁle PCIe Gen4 card and a low 40-60

watt (W) conﬁgurable thermal design power (TDP) capability, the A2

brings adaptable inference acceleration to any server.

A2's versatility, compact size, and low power exceed the demands

for edge deployments at scale, instantly upgrading existing entry-

level CPU servers to handle inference. Servers accelerated with A2

GPUs deliver higher inference performance versus CPUs and more

efﬁcient intelligent video analytics (IVA) deployments than previous

GPU generations—all at an entry-level price point.

NVIDIA-Certiﬁed Systems

™

featuring A2 GPUs and NVIDIA AI,

including the NVIDIA Triton

™

Inference Server, deliver breakthrough

inference performance across edge, data center, and cloud.

They ensure that AI-enabled applications deploy with fewer servers

and less power, resulting in easier deployments, faster insights, and

signiﬁcantly lower costs.

Up to 20X More Inference Performance

AI inference is deployed to make consumer lives more convenient

through real-time experiences, and enables them to gain insights on

trillions of end-point sensors and cameras. Compared to CPU-only

servers, the servers built with NVIDIA A2 Tensor Core GPU offer up

to 20X more inference performance, instantly upgrading any server to

handle modern AI.

DATASHEET

NVIDIA A2 TENSOR CORE GPU

Entry-level GPU that brings NVIDIA AI to any server.

NVIDIA A2 TENSOR CORE GPU | DATASHEET | 1

SYSTEM SPECIFICATIONS

Peak FP32 4.5 TF

TF32 Tensor Core 9 TF | 18 TF¹

BFLOAT16 Tensor

Core

18 TF | 36 TF¹

Peak FP16 Tensor

Core

18 TF | 36 TF¹

Peak INT8 Tensor

Core

36 TOPS | 72 TOPS¹

Peak INT4 Tensor

Core

72 TOPS | 144 TOPS¹

RT Cores 10

Media engines 1 video encoder

2 video decoders

(includes AV1 decode)

GPU memory 16GB GDDR6

GPU memory

bandwidth

200GB/s

Interconnect PCIe Gen4 x8

Form factor 1-slot, Low-Proﬁle PCIe

Max thermal

design power

(TDP)

40-60W (Conﬁgurable)

vGPU software

support²

NVIDIA Virtual PC

(vPC), NVIDIA Virtual

Applications (vApps),

NVIDIA RTX Virtual

Workstation (vWS), NVIDIA

AI Enterprise, NVIDIA

Virtual Compute Server

(vCS)

¹ With sparsity

² Supported in future vGPU release

System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N

22Hz, 512B DDR4 | omputer Vson EfcentDet-D0 (OO, 512x512) |

TensorRT 82, Precson INT8, BS8 (PU) | OpenVINO 20214, Precson INT8,

BS8 (PU)

6X 10X

8X2X 4X

Computer Vision (EfﬁcientDet-DO)

System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N

22Hz, 512B DDR4 | NLP BERT-Large (Sequence length 384, SQuAD

v11) | TensorRT 82, Precson INT8, BS1 (PU) | OpenVINO 20214,

Precson INT8, BS1 (PU)

Natural Language Processing (BERT-Large)

System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N

22Hz, 512B DDR4 | Text-to-Speech Tacotron2 + Waveglow end-to-end

ppelne (nput length 128) | PyTorch 19, Precson FP16, BS1 (PU) | PyTorch

19, Precson FP32, BS1 (PU)

15X 20X 25X

20X

5X 10X

Text-to-Speech (Tacotron2 + Waveglow)

MobileNet v2

0.0x

0.5x

1.0x

1.5x

Relative Performance (Video Streams 1080p30)

1.0X

1.2X

1.0X

1.3X

NVIDIA T4

ShufﬂeNet v2

NVIDIA A2

SystemConﬁguration: [Supermicro SYS-1029GQ-TRT, 2S Xeon Gold 6240 2.6GHz,

512GB DDR4, 1x NVIDIA A2 OR 1x NVIDIA T4] | Measured performance with

Deepstream 5.1. Networks: ShufﬂeNet-v2 (224x224), MobileNet-v2 (224x224) |

Pipeline represents end-to-end performance with video capture and decode,

pre-processing, batching, inference, and post-processing.

A2 Improves Performance by Up to 1.3X Versus T4

IVA Performance (Normalized)

NVIDIA A2

40 65 70 75

TDP Operatng Range (Watts)

A2 Reduces Power Consumption by Up to

40% Versus T4

Lower Power and Conﬁgurable TDP

55 6045 50

NVIDIA T4

2X 4X

Inference Speedup

omparsons of one NVIDIA A2 Tensor ore PU versus a

dual-socket Xeon old 6330N PU

NVIDIA A2

PU

Inference Speedup

omparsons of one NVIDIA A2 Tensor ore PU versus a

dual-socket Xeon old 6330N PU

NVIDIA A2

PU

Inference Speedup

omparsons of one NVIDIA A2 Tensor ore PU versus a

dual-socket Xeon old 6330N PU

NVIDIA A2

PU

System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N

22Hz, 512B DDR4 | omputer Vson EfcentDet-D0 (OO, 512x512) |

TensorRT 82, Precson INT8, BS8 (PU) | OpenVINO 20214, Precson INT8,

BS8 (PU)

6X 10X

8X2X 4X

Computer Vision (EfﬁcientDet-DO)

System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N

22Hz, 512B DDR4 | NLP BERT-Large (Sequence length 384, SQuAD

v11) | TensorRT 82, Precson INT8, BS1 (PU) | OpenVINO 20214,

Precson INT8, BS1 (PU)

Natural Language Processing (BERT-Large)

System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N

22Hz, 512B DDR4 | Text-to-Speech Tacotron2 + Waveglow end-to-end

ppelne (nput length 128) | PyTorch 19, Precson FP16, BS1 (PU) | PyTorch

19, Precson FP32, BS1 (PU)

15X 20X 25X

20X

5X 10X

Text-to-Speech (Tacotron2 + Waveglow)

MobileNet v2

0.0x

0.5x

1.0x

1.5x

Relative Performance (Video Streams 1080p30)

1.0X

1.2X

1.0X

1.3X

NVIDIA T4

ShufﬂeNet v2

NVIDIA A2

SystemConﬁguration: [Supermicro SYS-1029GQ-TRT, 2S Xeon Gold 6240 2.6GHz,

512GB DDR4, 1x NVIDIA A2 OR 1x NVIDIA T4] | Measured performance with

Deepstream 5.1. Networks: ShufﬂeNet-v2 (224x224), MobileNet-v2 (224x224) |

Pipeline represents end-to-end performance with video capture and decode,

pre-processing, batching, inference, and post-processing.

A2 Improves Performance by Up to 1.3X Versus T4

IVA Performance (Normalized)

NVIDIA A2

40 65 70 75

TDP Operatng Range (Watts)

A2 Reduces Power Consumption by Up to

40% Versus T4

Lower Power and Conﬁgurable TDP

55 6045 50

NVIDIA T4

2X 4X

Inference Speedup

omparsons of one NVIDIA A2 Tensor ore PU versus a

dual-socket Xeon old 6330N PU

NVIDIA A2

PU

Inference Speedup

omparsons of one NVIDIA A2 Tensor ore PU versus a

dual-socket Xeon old 6330N PU

NVIDIA A2

PU

Inference Speedup

omparsons of one NVIDIA A2 Tensor ore PU versus a

dual-socket Xeon old 6330N PU

NVIDIA A2

PU

System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N

22Hz, 512B DDR4 | omputer Vson EfcentDet-D0 (OO, 512x512) |

TensorRT 82, Precson INT8, BS8 (PU) | OpenVINO 20214, Precson INT8,

BS8 (PU)

6X 10X

8X2X 4X

Computer Vision (EfﬁcientDet-DO)

System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N

22Hz, 512B DDR4 | NLP BERT-Large (Sequence length 384, SQuAD

v11) | TensorRT 82, Precson INT8, BS1 (PU) | OpenVINO 20214,

Precson INT8, BS1 (PU)

Natural Language Processing (BERT-Large)

System onguraton PU HPE DL380 en10 Plus, 2S Xeon old 6330N

22Hz, 512B DDR4 | Text-to-Speech Tacotron2 + Waveglow end-to-end

ppelne (nput length 128) | PyTorch 19, Precson FP16, BS1 (PU) | PyTorch

19, Precson FP32, BS1 (PU)

15X 20X 25X

20X

5X 10X

Text-to-Speech (Tacotron2 + Waveglow)

MobileNet v2

0.0x

0.5x

1.0x

1.5x

Relative Performance (Video Streams 1080p30)

1.0X

1.2X

1.0X

1.3X

NVIDIA T4

ShufﬂeNet v2

NVIDIA A2

SystemConﬁguration: [Supermicro SYS-1029GQ-TRT, 2S Xeon Gold 6240 2.6GHz,

512GB DDR4, 1x NVIDIA A2 OR 1x NVIDIA T4] | Measured performance with

Deepstream 5.1. Networks: ShufﬂeNet-v2 (224x224), MobileNet-v2 (224x224) |

Pipeline represents end-to-end performance with video capture and decode,

pre-processing, batching, inference, and post-processing.

A2 Improves Performance by Up to 1.3X Versus T4

IVA Performance (Normalized)

NVIDIA A2

40 65 70 75

TDP Operatng Range (Watts)

A2 Reduces Power Consumption by Up to

40% Versus T4

Lower Power and Conﬁgurable TDP

55 6045 50

NVIDIA T4

2X 4X

Inference Speedup

omparsons of one NVIDIA A2 Tensor ore PU versus a

dual-socket Xeon old 6330N PU

NVIDIA A2

PU

Inference Speedup

omparsons of one NVIDIA A2 Tensor ore PU versus a

dual-socket Xeon old 6330N PU

NVIDIA A2

PU

Inference Speedup

omparsons of one NVIDIA A2 Tensor ore PU versus a

dual-socket Xeon old 6330N PU

NVIDIA A2

PU

Summary of content (3 pages)

PAGE 1
DATASHEET NVIDIA A2 TENSOR CORE GPU Entry-level GPU that brings NVIDIA AI to any server. Versatile Entry-Level Inference SYSTEM SPECIFICATIONS The NVIDIA A2 Tensor Core GPU provides entry-level inference with low power, a small footprint, and high performance for NVIDIA AI at the edge. Featuring a low-profile PCIe Gen4 card and a low 40-60 watt (W) configurable thermal design power (TDP) capability, the A2 brings adaptable inference acceleration to any server.
PAGE 2
Higher IVA Performance for Intelligent Edge Servers equipped with A2 offer up to 1.3X more performance in intelligent edge use cases, including smart cities, manufacturing, and retail. NVIDIA A2 GPUs running IVA workloads result in more efficient deployments with up to 1.6X better price-performance and ten percent better energy efficiency than previous GPU generations. A2 Improves Performance by Up to 1.
PAGE 3
Complete Inference Portfolio NVIDIA offers a complete portfolio of NVIDIA-Certified Systems featuring Ampere Tensor Core GPUs as the inference engine powering NVIDIA AI. A2 Tensor Core GPUs add entry-level inference in a low-profile form factor to the NVIDIA AI portfolio that already includes A100 and A30 Tensor Core GPUs. A100 features the highest inference performance at every scale and A30 brings optimal inference performance for mainstream servers.