At the end of 2016, eight of Google’s ten core developers on the TPU team quietly left to start a machine learning systems company called Groq, the 100th to enter the AI acceleration card, the second commercialization to market, and the first to reach 1,000 trillion operations per second. If you compare, it’s four times the most powerful graphics card performance of the current NVIDIA.
The Groq Streaming Tensor Processor (TSP) requires 300W per core, and they’ve made it. And, even luckier, it has been transformed from a disadvantage to a top of the TSP.
This TSP is a huge silicon processor with virtually no vector and matrix processing units and caches, so there is no controller or back end, and the compiler has direct control. TSP is divided into 20 super channels. Superchannels are built from left to right: matrix units (320 MAC), switching units, storage units (5.5 MB), vector units (16 ALU), storage units (5.5 MB), swap units, matrix units (320 MAC).
The instruction flow (only one) is fed to each component of hyperchannel 0, where the matrix unit has 6 instructions, the switching unit has 14 instructions, the storage unit has 44 instructions, and the vector unit has 16 instructions. Each clock cycle, the unit performs operations and moves the data to the next location within the hyperchannel. Each component can send and receive 512B from its neighbors.
When the superchannel is complete, it passes everything to the next superchannel and receives all the content owned by the superchannel (or instruction controller) above. Instructions are always passed vertically and down between superchannels, while data is transmitted horizontally only within the superchannel.
In The ResNet-50, it can perform 20,400 inferences per second (I/S) at any batch size, with a inference delay of 0.05 milliseconds. Nvidia’s Tesla V100 can execute 7,907 I/S in 128 batch sizes or 1,156 I/S in batch size s1.
But with Groq’s hardware and software, the compiler can know exactly how the chip works and how long it takes to perform each calculation. The compiler moves data and instructions to the right place at the right time so that there is no delay. The flow of instructions to the hardware is fully orchestrated, making processing faster and predictable.
Developers can run the same model 100 times on the Groq chip, each with exactly the same results. For applications where safety and accuracy are very demanding, such as self-driving cars, this computational accuracy is critical. In addition, systems designed with Groq hardware are not affected by long tail delays, and AI systems can be adjusted within a specific power or latency budget.
This software-first design (i.e., compiler-determined hardware architecture) helped Groq design a simple, high-performance architecture that accelerates the inference process. The architecture supports both traditional machine learning models and new computational learning models and is currently running on customer sites in x86 and non-x86 systems.
According to an official press release, the TSP has been open to some customers as an accelerator for Nimbix Cloud.