The latest trend in artificial intelligence is that larger natural language models can provide better accuracy, but larger models are difficult to train due to cost, time, and code integration barriers. Microsoft recently opened deep learning optimization library DeepSpeed, which enables deep learning models with more than 100 billion parameters to train on current generation GPU clusters by increasing scale, speed, availability, and reducing costs.
At the same time, its system performance is more than 5 times higher than that of the latest technology.
According to Microsoft, the DeepSpeed library has a component called ZeRO (Zero Redundancy Optimizer), a new parallel optimizer that significantly reduces the resources required for model and data paralleling. At the same time, the number of parameters that can be trained can be substantially increased. The researchers used these breakthroughs to create turing-NLG, the largest open language model with 17 billion parameters.
As part of DeepSpeed, ZeRO is a new memory optimization technology for large-scale distributed deep learning that trains deep learning models with 100 billion parameters on current GPU clusters with throughput s3 to 5 times that of the current best systems. It also provides a clear idea for training models with trillions of parameters.
ZeRO has three main optimization stages that correspond to the optimizer state, gradient, and parameter partition.
ZeRO overcomes the limitations of data parallelism and model parallelism while realizing the advantages of both, eliminating memory redundancy between data parallel processes by dividing model state into parameters, gradients, and optimizer state partitions as shown in the figure above, rather than copying them. Use dynamic communication schedules during training to share the necessary states between distributed devices to maintain the compute granularity and traffic of data in parallel.
The first phase of ZeRO, the Optimizer State Partition (ZeRO-OS for short), is now implemented, with the powerful capability to support the 100 billion parametric model, which was released with DeepSpeed.
DeepSpeed is compatible with PyTorch, and the DeepSpeed API is a lightweight package on PyTorch, which means developers can use everything in PyTorch without having to learn about the new platform. In addition, DeepSpeed manages all modelled SOTA training technologies, such as distributed training, blending accuracy, gradient accumulation, and checkpoints, so developers can focus on model development. At the same time, developers can take advantage of DeepSpeed’s unique efficiency and benefits to increase speed and scale by making only a few lines of code changes to the PyTorch model.
DeepSpeed excels in four areas:
Scale: Large, state-of-the-art models such as OpenAI GPT-2, NVIDIA Megatron-LM and Google T5 have 1.5 billion, 8.3 billion and 11 billion parameters, respectively, while DeepSpeed’s ZeRO Phase 1 provides system support to run up to 100 billion parameters of the model, which is 10 times larger than the current most advanced model. Future plans will increase support for ZeRO Phases II and III, providing the ability to model up to 200 billion or even trillions of parameters.
Speed: On a variety of hardware, the current observed throughput is 5 times higher than the current state-of-the-art technology. For example, in order to train large models on GPT series workloads, DeepSpeed combines ZeRO-based data in parallel with the NVIDIA Megatron-LM model to work on NVIDIA GPU clusters with low bandwidth interconnects (no NVIDIA NV or Infiniband), DeepSpeed increased throughput by 3.75 times compared to using Megatron-LM only for standard GPT-2 models with 1.5 billion parameters. On NVIDIA DGX-2 clusters with high bandwidth interconnects, models with 2 0 to 80 billion parameters are three to five times faster. These throughput increases come from DeepSpeed’s higher memory efficiency and the ability to fit these models in parallel with a lower level of model parallel and a larger batch volume.
Cost: Increasing throughput means significantly reducing training costs, for example, DeepSpeed requires 3/4 of the original resources to train a model with 20 billion parameters.
Ease of use: You can make the PyTorch model use DeepSpeed and ZeRO with just a few lines of code. DeepSpeed does not need to redesign the code or refactor the model compared to the current model parallel library, nor does it limit the size of the model, batch size, or any other training parameters. For models with parameters of up to 6 billion, it is easy to use the data provided by ZeRO to parallelize without the need for model parallelism. In contrast, for models with more than 1.3 billion parameters, standard data will run out of memory in parallel. ZeRO Phases II and III will further increase the size of the model that can be trained only by paralleling data. In addition, DeepSpeed supports a flexible combination of data parallel and model parallel supported by ZeRO.
For more specific introductions, check out Microsoft’s blog:
ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters