NVIDIA Ampere Architecture In-depth Analysis: Significantly Increases Cloud AI Chip Threshold

At the recent GTC, NVIDIA released the latest Ampere architecture and the A100 GPU based on the Ampere architecture. The A100 GPU is implemented using the TSMC 7nm process and contains 54.2 billion transistors, which officially delivers seven times higher performance than the previous V100. In addition to the increase in computing power, NVIDIA has added the GPU Virtual Multi-Instance (multi-Instance GPU, MIG) feature, which allows a GPU virtualization to be called 7 separate GPUs.

NVIDIA Ampere Architecture In-depth Analysis: Significantly Increases Cloud AI Chip Threshold

Also announced alongwith ampey architecture is the NVIDIA DGX A100 supercomputer, which includes eight A100 GPUs with a peak computing force of up to 10 PetaOPS.

At the launch, NVIDIA did a lot of rendering of the calculation power. However, in our view, NVIDIA’s expansion of features beyond computing power will be a more important threshold, and the Chinese semiconductor industry’s desire to develop autonomous GPUs also needs to consider these important features beyond computing power.

Computing architecture: improved updates, pace ahead in line with expectations

Compared to the previous generation of V100 GPUs, the NVIDIA A100 GPU’s efficiency gains come mainly from the following:

Add sparse operation support. This is probably the biggest innovation in the A100 GPU computing architecture. Specifically, the A100 supports 2:4’s structured sparseness, i.e. when using sparse calculations, two or more of the four elements in the matrix must be 0. With sparse operations, you can triple your performance.

In fact, the concept of using sparse computing in deep learning has been around for almost five years, and today NVIDIA has finally brought the concept to the product, and is using a 2:4 structured sparse, twice as much acceleration can be said to be more conservative (compared to the 2018 Cambrian AI accelerator IP supports four times sparse acceleration).

NVIDIA Ampere Architecture In-depth Analysis: Significantly Increases Cloud AI Chip Threshold

The TF32 number system was introduced. This is mainly for training calculations. Looking back on the course of artificial intelligence training calculation, the first common use of 32-bit floating-point number system (FP32). In order to speed up training calculation, NVIDIA began to support the 16-bit FP16 number system from a few years ago, which has the advantage of being faster, but there are some problems in some applications in terms of dynamic range.

In A100, NVIDIA introduced TF32 number system in order to solve the problem of FP16. Tf32 is not in fact a 32-digit system, but a 19-digit system, its dynamic range (exponent) and FP32 are the same as 8-bit, but its accuracy (mantisssa) and FP16 are the same as 10-bit, equivalent to FP32 and FP16 fusion. Compared to FP32, the TF32 achieves 8x throughput improvement.

NVIDIA Ampere Architecture In-depth Analysis: Significantly Increases Cloud AI Chip Threshold

Stronger and more stream processors (SM). In the A100, the per-stream processor has twice the capacity to compute the cheungs matrix of the V100, while the number of stream processors in the GPU is 30% higher than the V100.

Larger on-chip storage and faster memory interfaces. In the A100 design, the L1 cache capacity per stream processor increased from 128KB of V100 to 192KB, while the L2 cache increased to 40MB, a 6.7-fold increase over the previous generation. In terms of memory interfaces, the A100’s HBM2 has a total loan of 1555GB/s, an increase of 1.7X compared to the previous generation.

In general, in terms of computing architecture, apart from supporting sparse computing and introducing TF32, other elevations are predictable general additions, and sparse computing and TF32 are not new concepts in artificial intelligence computing. We believe that this generation of NVIDIA A100’s computing performance improvements are incremental, not revolutionary.

GPU Virtual Instances and Connectivity: Further Elevating Barriers to Competition

We believe that, in addition to computing power, the A100’s more important barriers to competition have come from GPU virtual instance support and interconnection solutions for data centers.

An important new feature in the ampere architecture is GPU virtual instance MIG. As the proportion of GPU deployments in cloud data centers increases, how to achieve GPU virtualization is an important task, and if not addressed, this will reduce overall GPU utilization.

At present, in cloud services, the user requested CPU and memory instances are mostly virtualized, when you apply to n CPU core, it is not that you have wrapped this CPU chip, but it is likely that different cores on the same CPU chip will be assigned to different users, and users do not have to worry about which chip his CPU core is located on, mainly on the line.

Roughly speaking, this is CPU virtualization. GPUs have also previously been virtualized, i.e. the same GPU can be used by different programs at the same time, but its memory access model is not as sophisticated as CPU virtualization, so in the case of multiple users, it is not usually used to share a GPU at the same time, but to assign a GPU to one user.

This brings about efficiency problems, such as user A only need to use a GPU half of the computing resources, and user B calculation needs to use 1.5 GPUs, then the use of traditional coarse grainy solutions will result in user A and B are occupying a GPU, then user A is actually a waste of GPU resources, and user B computing resources need stake is not well met.

As GPUs are applied to more and more scenarios, different scenario algorithms are different for GPU utilization and requirements, so following the previous coarse-grained scheme will certainly cause problems with overall data center GPU utilization.

NVIDIA Ampere Architecture In-depth Analysis: Significantly Increases Cloud AI Chip Threshold

To solve this problem, MIG came into being. The MIG in the A100 supports the division of the same GPU into seven separate instances, with memory space access between each instance not interfering with each other, enabling the distribution of fine-grained GPU computing resources, thus increasing resource utilization efficiency in computing environments where computing requirements are very heterogeneous.

True, the seven GPU virtual instance divisions currently supported in MIG may not be particularly granular, but they can be seen as an important milestone toward virtualization. In addition to MIG, the A100 has also improved on multi-chip connectivity.

First, the A100 contains the third generation of NVLINK, mainly used to communicate with the GPU on the host, communication bandwidth compared to the V100 doubled to 600GB/s. The A100 supports PCIe Gen4 in GPU and CPU communication, double the bandwidth of the previous generation OF PCIe Gen3. In addition, the A100’s interconnection is deeply integrated with Melanox’s solution sands to support Ethernet-based and InfiniBand RDMA.

NVIDIA Ampere Architecture In-depth Analysis: Significantly Increases Cloud AI Chip Threshold

Cloud AI chip entry threshold greatly raised

We believe that the launch of the NVIDIA A100 has once again opened the gap with other chip competitors in the aia cloud. In terms of calculation, the NVIDIA A100 is 11 times better on the BERT benchmark than the T4, while the most successful Habana of the start-up (now highly priced by Intel) last year launched its new Goya chip on the same benchmark only about twice as good as T4, so the A100 has taken over the top spot. We believe that the main advantage of NVIDIA in the improvement of computing power is its strong system engineering capability.

As we’ve analyzed before, the compute unit architecture innovations that NVIDIA used in the A100 are not really new, have been around in artificial intelligence hardware for years, and many start-ups have tried similar implementations before. However, when the size of the chip has risen, its design process is not only a logical design problem, but also need to consider the yield, heat dissipation and other factors, and these seemingly underlying factors in the top-level architectural design process to consider – in other words, although others can also think of using these architectural innovations, but because there is no way to achieve the A100 energy-producing giant chips, which is actually a barrier that NDIAVI has accumulated over the years.

In fact, we believe that computing power is only a small part of the NVIDIA A100 hardware barrier, and its more important barriers come from the characteristics of interconnection, virtualization, and so on. Connectivity and virtualization features are important requirements in cloud data center scenarios, and their implementation requires solid, step-by-step design and accumulation.

If before NVIDIA has not introduced virtualization features, cloud AI acceleration chip or power competition so start-ups still have the opportunity to bend overtaking, then after the A100 we think that other and NVIDIA for the same market of cloud AI accelerated chip start-ups have lost this opportunity, and must step by step virtualization, RDMA and other distributed computing must be implemented on their own chips, in order to be eligible to face up to NVIDIA.

Another possible strategy for the cloud computing market is for other chipmakers to target areas where NVIDIA cannot yet be considered and the GPU’s SIMT architecture is not well covered, such as some of FinTech’s calculations, and so on. We expect more of these start-ups to emerge in the next few years.

Inspiration for GPU localization: Computing is not everything, and support for distributed computing and virtualization is also important

The A100 GPU released by NVIDIA also has important implications for GPU localization for cloud data centers, namely that computing power is not everything, and support for distributed computing and multi-user virtualization may be even more important.

In today’s high-performance computing in the cloud, a large part of the task uses distributed computing. In distributed computing, the computing power of a single-card GPU is only the basis, and IO in addition to computing power can be an important factor in determining performance. The IO here includes communication between stand-alone polycards, communication between GPUs and CPUs, and communication between multiple hosts.

In NVIDIA’s technology stack, stand-alone multi-card communications have NvLink, multi-machine communications have RDMA and Smart NIC technology from the newly acquired Melanox, it can be said that in the IO field NVIDIA is also the world’s leading, so as to ensure that the cloud GPU solution is unique. The most relevant thing for distributed computing is virtualization support. As mentioned earlier, GPU virtualization will lead to significant increases in GPU resource utilization in the cloud computing space.

However, in addition to increased utilization, the virtualized access model provides a clean interface for the software stack of distributed computing, so that engineers in distributed systems can build flexible multi-user models and interfaces without having to care about the implementation details of the GPU underlying, thus providing strong support and empowerment for efficient distributed systems at the system level.

We believe that GPU virtualization is still in its early stages and we will see NVIDIA and other European and American vendors invest in this direction in the future. For domestic GPU, we have been emphasizing the need to build a good ecology, in order to make the domestic GPU really competitive. This ecosystem includes, first and foremost, a well-descaled architecture that points to the support of data communication interconnections such as IO, and a friendlier and easy-to-use development environment that allows developers to develop a variety of multi-user-enabled cloud applications on a hardware-based basis, with virtualization as the core component of multi-user support.

We believe that a powerful GPU with limited support for distributed computing and virtualization is not as good for the home-grown ecosystem as a GPU that is weak for the home-grown ecosystem (for example, only half or even one-third of NVIDIA), but has reasonably complete support for distributed and multi-user scenarios. And these two just need step by step solid accumulation, can not expect the bend overtaking.