

† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature Visualization & Ray Tracing GPUs * theoretical peak performance based on GPU boost clock Host-to-GPU transfer bandwidth (bidirectional) GPU-to-GPU transfer bandwidth (bidirectional) NVLink is limited to pairs of directly-linked cards To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC/AI expert. Thus, the performance values of the PCI-E A100 GPU are shown as a range and actual performance will vary by workload. For this reason, the PCI-Express GPU is not able to sustain peak performance in the same way as the higher-power part. Note that the PCI-Express version of the NVIDIA A100 GPU features a much lower TDP than the SXM4 version of the A100 GPU (250W vs 400W). The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for computation and deep learning/AI/ML. Please contact our team for additional review and discussion. Additional details on each are shared in the tabs below, and the best choice will depend upon your mix of workloads. GDDR6X memory providing up to 768 GB/s of GPU memory throughputĪs stated above, the feature sets vary between the “computational” and the “visualization” GPU models.3rd-generation NVLink provides up to 56.25 GB/sec bandwidth between pairs of GPUs in each direction.3rd-generation Tensor Cores with TF32 and support for sparsity optimizations.2nd-generation RT cores provide up to a 2x increase in raytracing performance.(previous generations provided one dedicated FP32 path and one dedicated Integer path) Double FP32 processing throughput with upgraded Streaming Multiprocessors (SM) that support FP32 computation on both datapaths.Visualization “Ampere” GPU architecture – important features and changes: Compute Data Compression accelerates compressible data patterns, resulting in up to 4x faster DRAM bandwidth, up to 4x faster L2 read bandwidth, and up to 2x increase in L2 capacity.

Improved L2 Cache is twice as fast and nearly seven times as large as L2 on Tesla V100.Larger and Faster L1 Cache and Shared Memory for improved performance.Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.4th-generation PCI-Express doubles transfer speeds between the system and each GPU.3rd-generation NVLink doubles transfer speeds between GPUs.Multi-Instance GPU allows each A100 GPU to run seven separate/isolated applications.High-speed HBM2 Memory delivers 40GB or 80GB capacity at 1.6TB/s or 2TB/s throughput.Tensor Cores support many instruction types: FP64, TF32, BF16, FP16, I8, I4, B1.Speedups of 7x~20x for inference, with sparse INT8 TensorCores (vs Tesla V100).Speedups of 3x~20x for network training, with sparse TF32 TensorCores (vs Tesla V100).Sparse matrix optimizations potentially double training and inference performance.TensorFloat 32 (TF32) instructions improve performance without loss of accuracy.Exceptional AI deep learning training and inference performance:.19.5 TFLOPS FP32 single-precision floating-point performance.Up to 19.5 TFLOPS FP64 double-precision via Tensor Core FP64 instruction support.9.7 TFLOPS FP64 double-precision floating-point performance.Computational “Ampere” GPU architecture – important features and changes: The specifications of both versions are shown below – speak with one of our GPU experts for a personalized summary of the options best suited to your needs. Broadly-speaking, there is one version dedicated solely to computation and a second version dedicated to a mixture of graphics/visualization and compute. Note that not all “Ampere” generation GPUs provide the same capabilities and feature sets. Ampere A100 GPUs began shipping in May 2020 (with other variants shipping by end of 2020). “Ampere” GPUs improve upon the previous-generation “Volta” and “Turing” architectures. This article provides details on the NVIDIA A-series GPUs (codenamed “Ampere”).
