[Papers] ZeRO-Offload: Democratizing Billion-Scale Model Training

[Papers] ZeRO-Offload: Democratizing Billion-Scale Model Training

2021. 4. 3. 19:19ㆍPapers

[Link to Paper] ZeRO-Offload

PAPER SUMMARY

PROBLEM	Training large models requires having enough GPU devices so that the GPU memory can hold model states for training (despite pipeline parallelism, model parallelism ...) Using a lot of GPUs is cost-burdening, making it difficult for people to attempt training
SOLUTION	Democratize large model training by ZeRO-Offload, which exploits both CPU memory and compute for offloading offload gradients, optimizer states, and optimizer computation to CPU / keep paramteters and forward & backward computation on GPU

BACKGROUND

⭐️ How memory is consumed in large model training, some methods of efficiently handling this, and how ZeRO works for backup.

Memory Consumption in Large model training

DNN model training classified in two parts;

Model States : parameters, gradients, and optimizer states
Residual States: activations, buffers, fragmented memory etc.

Among these two, Model States are the main bottleneck for large model training because;

for each parameter, using mized precision, two copies are kept in fp16 (2 bytes), and fp32 (4 bytes)
the corresponding gradient for each parameter in fp16 (2 bytes)
for each parameter, Adam optimizer keeps the momentum in fp32 (4 bytes) and variance in fp32 (4 bytes)

$\therefore$ for a model with $M$ parameters, it requires $16M$ bytes of memory

$\exists$ two categories when attempting to fit these model and residual states on GPUs;

Scale-out training
Scale-up training

Scale-out large model training

Uses aggregate memory of multiple GPUs to satisfy memory requirement

Model Parallelism: partitions model vertically (intra-layer partitioning)
Pipeline Parallelism: partitions model horizontally (inter-layer partitioning)
ZeRO: splits training batch across GPUs (similar to DP), partition model across GPUs instead of replicating, communicates to gather individual parameters

However all three require multiple GPUs that make up for the required memory

ZeRO-Offload offloads model states to CPU memory instead

Scale-up large model training

How existing works scale up model size in a single GPU;

Recomputing from checkpoints $\therefore$ no need to save all activation states
Using low or mixed precision
Using external memory such as CPU memory $\Leftarrow$ approach of ZeRO-Offload

ZeRO powered data parallel training

Three Stages of ZeRO;

ZeRO-1 : partition optimizer states
ZeRO-2: partition optimizer states, gradients
ZeRO-3: partition all model states

ZeRO-Offload works with ZeRO-2;

GPU stores replica of all parameters, but update a mutually exclusive portion at the end of each training step $\Rightarrow$ only store the optimizer states and gradients needed for that portion
Forward pass: each GPU computes loss WRT different mini-batch
Backward pass: each gradient is computed $\Rightarrow$ averaged using reduce
After backward pass, each GPU updates its portion of parameters and optimizer states using averaged gradients
Perform all-gather

Offload Strategy

⭐️ How ZeRO-Offload handles a graph for offloading

ZeRO-Offload uses a data-flow graph and partitions this gaph for efficient offloading (specified for mixed precision training with Adam optimizer)

DL Training as a data flow graph

Offload strategy represented as on the right;
- computation is executed / data node is stored on the device (CPU/GPU) that owns the partition

Graph partitioning is done based on;
1. CPU computation overhead
2. Communication overhead
3. Memory savings

1. Limiting CPU computation

CPU computation throughput is much slower than GPU computation throughput $\Rightarrow$ avoid offloading compute-intensive components to CPU
General complexity of DL training per iteration : $O(MB) \Rightarrow$ this should be done on GPU
norm calculations, weight updates (complexity $O(M)$) $\Rightarrow$ offloaded to CPU

2. Minimizing Communication Volume

PCI-E bandwidth btw CPU and GPU $\ll$ CPU memory bandwidth $\ll$ GPU memory bandwidth
$\therefore$ minimize communication between CPU and GPU!!
Create fp32 super-node : keep fp32 model state is one partition (a node named UpdateSuper)
parameter16 is assigned with FWD-BWD super node on GPU

3. Maximizing Memory savings

maximum memory saving can be done by offloading both gradient16 and Update Super to CPU

⭐️ GPU : FWD-BWD Super node, parameter 16

⭐️ CPU : Update Super node (consists of fp32 states), gradient 16

ZeRO-Offload Schedule

⭐️ Computation & communication schedule for implementing ZeRO-Offload on single GPU system, and further explanations on multi-GPUs

Single GPU Schedule

Forward pass: compute loss. fp16 parameters are already on GPU, so no CPU communication is needed
Backward pass: gradient for parameters are computed at different points in backward graph
ZeRO-Offload transfers these gradients to CPU memory (transferring can overlap with backward propagation on remaining backward graph) $\Rightarrow$ possible to hide communication cost
Update fp32 parameters and remaining optimizer states on CPU
Copy updated fp32 parameters from CPU to fp16 parameters on GPU

Scaling to Multi-GPUs

Calculates gradients using ZeRO-2, while offloading partitioned gradients, optimizer states and parameter updates to CPU.

Optimized CPU Execution

⭐️ How speedup of CPU execution is achieved for parameter updates

1. Implementing the CPU Optimizer

Faster implementation of Adam optimizer using SIMD vector instruction, loop unrolling, and OMP multithreading

2. One-step Delayed Parameter Update

CPU computation may become bottleneck when $GPU computation time \leq CPU compute time$

$\Rightarrow$ one-step delayed parameter update (DPU) : overlap CPU and GPU compute

Delayed parameter update during training process

First $N-1$ steps : trained without DPU to avoid destabilizing training in early stages (gradients change rapidly)
Step $N$ : obtain gradients from GPU, skip CPU optimizer step, do not update fp16 parameters on GPU either
Step $N+1$ : compute parameter updates on CPU using gradients from step $N$, compute forward and backward pass on GPU in parallel using parameters updated at step $N-1$

$\therefore$ model at $(i+1)$ step is trained using parameters updated with gradients from $(i-1)$ step, to overlap CPU compute with GPU compute

'Papers' 카테고리의 다른 글

[Papers] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (0)	2021.04.10

[Papers] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism 2021.04.10

Welcome To The Show