In recent years, the super-computing community has been working on the “exascale” (10^18) calculation vision, which is expected to set the tone for the next decade. The Aurora supercomputer, built by Intel in partnership with Argonne National Laboratory, is also moving toward this goal. It is reported that the two companies have been signing a contract for some time, but with the changes in the market, as well as the hardware manufacturer’s setbacks, the project is not going well.
(Image via AnandTech)
Aurora’s overcomputed hardware component, which was expected to be delivered by 2020 by Argonne, Cray, and Intel. Built around Intel’s Xeon Phi platform, it improves throughput and acceleration with Intel’s AVX-512 instructions and 10nm Knights Hill architecture.
Unfortunately, the plans were made before artificial intelligence (AI) accelerated the revolution. Intel then added AVX-512 support to its server processor and ended up with a strong fusion platform (the short-lived Knights Mill).
With this in mind, Intel had to rethink how Aurora was built and how it fit into its own CPU and Xe GPU. As part of today’s announcement, Intel disclosed some of the underlying information about Aurora’s overcalculations.
While information such as the core number of architectures, memory types, and so on is not disclosed, it is at least clear that the standard nodes will contain two-way next-generation CPUs and six-way next-generation GPU hardware, and that they will collaborate through new connectivity standards.
The planned Sapphire Rapids CPU is Intel’s second-generation 10nm server processor after Ice Lake Xeon. Today’s announcement reaffirms that the processor is expected to be available in the second half of 2021, while Ice Lake will be mass-produced by the end of 2020.
In terms of parameters, each Sapphire Rapids processor supports 8 channels of memory and has enough I/O to connect to the three GPUs. In a single Aurora compute node, two-way Sapphire Rapids processors work together and support the next generation of Apoten DCPMM persistent storage.
Another source said Sapphire Rapids may support DDR5, but this has not been confirmed by Intel. On the GPU side, each Aurora node will support six-card collaboration (Intel 7nm Ponte Vecchio Xe GPU).
Built on an Xe architecture-based microarchitecture, it uses Intel’s extensive number of key packaging technologies, such as The Foveros chip stacking, embedded multi-chip interconnect bridges (EMIBs), and high-bandwidth existing (HBM).
In terms of functionality, Intel only claims that PV will have vector matrix units and high double-precision performance, which may be necessary for Argonne’s research.
Another core technology in the Aurora node is the adoption of the new CXL connectivity standard. It allows the CPU and GPU to be directly connected and work in a unified memory space.
Each Aurora node will have eight Fabric endpoints, providing a number of topological connectivity options. As the cray section is added, the connectivity system will become a version of its Slingshot network architecture.
The architecture will also be used for other U.S. overcomputing projects in the early 2020s. Intel says Slingshot will provide Aurora with approximately 200 racks of connectivity, with 10 petabytes of memory and 230 petabytes of storage.
In summary, it is easy to estimate that Aurora Supercomputing has the following characteristics:
Support s200 racks to work together;
Each rack may be available in a standard 42U configuration;
Each Aurora node is a standard 2U configuration;
the system or a total of 200 racks;
Support s6U and networking features per rack;
One-third of this can be used for storage and other systems;
Rounding 2400 Aurora Overcomputed nodes (2394).
If so, the entire Aurora supercomputing system will feature only 5,000 Intel Sapphire Rapids CPUs and 15,000 Ponte Vecchio GPUs.
Assuming that ExaFLOP is spread equally to 15,000 subunit vendors, the average compute force per GPU is 66.6 TeraFLOP. However, the current GPU has FP32 performance of only about 14 TeraFlops.
If Intel were able to improve HPC’s single GPU performance by about 5 times, the increase would be quite impressive (assuming power limitations are not taken into account).