Recently, NVIDIA officially unveiled a new generation of GPU architecture “Ampere”, its huge scale, exquisite architecture is amazing, and not surprisingly, as always, the first core has been cut. The first thing to note is that unlike the Tesla P100, the Volt-architecture Tesla V100, the new computing card is simply called the “A100” and does not have a Tesla brand sequence, for unknown reasons, and may be intended for a wider range of applications.
At the same time, the new core is called the “A100 Tensor Core GPU” and highlights the key role of the tensor core, which is traditionally extended to GA100.
THE GA100 DESIGNED EIGHT SETS OF GPCS (GPU PROCESSING CLUSTERS), EACH GROUP OF GPCS DIVIDED INTO EIGHT GROUPS OF TPCS (TEXTURE PROCESSING CLUSTERS), EACH GROUP DIVIDED INTO TWO SETS OF SM (streaming multiprocessor), and then each set of SM contains 64 FP32 CUDA cores (stream processors).
As a result, a complete GA100 chip has a total of 128 sets of SM and 8196 stream processors, which is consistent with the previous NVIDIA GPU architecture.
At the same time, each set of SM also has 4 third-generation Tensor core, the entire chip a total of 512, the external with six HBM2 memory, each 8GB, 12 512-bit controller, the total bit width 6144-bit.
In addition, the secondary cache has soared from 6MB to 40MB, and the shared memory of each set of SM units has increased from up to 96KB to 164KB, register capacity or 256KB, but the entire chip register reaches 27MB.
The GA100 chip is manufactured using TSMC’s first generation 7nm (N7) process, with a core area of 826 square millimeters, an increase of only 11 square millimeters (0.13%) over the previous generation’s 12nm GV100, but the number of transistors has soared from 21.1 billion to 54.2 billion, nearly 1.6 times more, while power consumption is controlled at 400W (an increase of 33%), showing new architecture and new power.
Such a large core in the early stages of mass production will obviously be subject to the problem of good product rate, so the actual use of the A100 core did not reach the full specifications, but in the past, the simple shielding of the entire set of computing units, this cut slightly more complex.
The GPC unit was blocked for an entire set, but the rest was not all open, and two groups of GPCs each blocked a TPC (two sets of SM), resulting in a total of 108 SM units, 6912 stream processors, and 432 Tensor cores.
The core acceleration frequency of 1410MHz is actually lower than the previous two generations, but overall performance is leaping.
The memory also did not escape the knife method, only five sets of HBM2 opened, so the total capacity of 40GB, total bit width of 5120-bit, frequency 1215MHz, bandwidth 1555GB/s, an increase of 73% over the previous generation.
Specifically to each SM unit, the number of Tensor cores, although reduced from 8 to 4, supports up to 256 FP16FP FMA operations per clock cycle, a total of 1024, double the Volt and Turing architecture.
The new Tensor core also supports acceleration for all data types, including FP16, BF16, TF32, FP64, INT8, INT4, Binary.
More fine professional details here are not unfolded, you have no interest.
Comparison of the first core scale of the three-generation architecture