2021. 4. 3. 19:19ㆍPapers
[Link to Paper] ZeRO-Offload
PAPER SUMMARY
PROBLEM |
|
SOLUTION |
|
BACKGROUND
⭐️ How memory is consumed in large model training, some methods of efficiently handling this, and how ZeRO works for backup.
Memory Consumption in Large model training
DNN model training classified in two parts;
- Model States : parameters, gradients, and optimizer states
- Residual States: activations, buffers, fragmented memory etc.
Among these two, Model States are the main bottleneck for large model training because;
- for each parameter, using mized precision, two copies are kept in fp16 (2 bytes), and fp32 (4 bytes)
- the corresponding gradient for each parameter in fp16 (2 bytes)
- for each parameter, Adam optimizer keeps the momentum in fp32 (4 bytes) and variance in fp32 (4 bytes)
$\therefore$ for a model with $M$ parameters, it requires $16M$ bytes of memory
$\exists$ two categories when attempting to fit these model and residual states on GPUs;
- Scale-out training
- Scale-up training
Scale-out large model training
Uses aggregate memory of multiple GPUs to satisfy memory requirement
- Model Parallelism: partitions model vertically (intra-layer partitioning)
- Pipeline Parallelism: partitions model horizontally (inter-layer partitioning)
- ZeRO: splits training batch across GPUs (similar to DP), partition model across GPUs instead of replicating, communicates to gather individual parameters
However all three require multiple GPUs that make up for the required memory
ZeRO-Offload offloads model states to CPU memory instead
Scale-up large model training
How existing works scale up model size in a single GPU;
- Recomputing from checkpoints $\therefore$ no need to save all activation states
- Using low or mixed precision
- Using external memory such as CPU memory $\Leftarrow$ approach of ZeRO-Offload
ZeRO powered data parallel training
Three Stages of ZeRO;
- ZeRO-1 : partition optimizer states
- ZeRO-2: partition optimizer states, gradients
- ZeRO-3: partition all model states
ZeRO-Offload works with ZeRO-2;
- GPU stores replica of all parameters, but update a mutually exclusive portion at the end of each training step $\Rightarrow$ only store the optimizer states and gradients needed for that portion
- Forward pass: each GPU computes loss WRT different mini-batch
- Backward pass: each gradient is computed $\Rightarrow$ averaged using reduce
- After backward pass, each GPU updates its portion of parameters and optimizer states using averaged gradients
- Perform all-gather
Offload Strategy
⭐️ How ZeRO-Offload handles a graph for offloading
- ZeRO-Offload uses a data-flow graph and partitions this gaph for efficient offloading (specified for mixed precision training with Adam optimizer)
DL Training as a data flow graph
- Offload strategy represented as on the right;
- computation is executed / data node is stored on the device (CPU/GPU) that owns the partition
- Graph partitioning is done based on;
- CPU computation overhead
- Communication overhead
- Memory savings
1. Limiting CPU computation
- CPU computation throughput is much slower than GPU computation throughput $\Rightarrow$ avoid offloading compute-intensive components to CPU
- General complexity of DL training per iteration : $O(MB) \Rightarrow$ this should be done on GPU
- norm calculations, weight updates (complexity $O(M)$) $\Rightarrow$ offloaded to CPU
2. Minimizing Communication Volume
- PCI-E bandwidth btw CPU and GPU $\ll$ CPU memory bandwidth $\ll$ GPU memory bandwidth
- $\therefore$ minimize communication between CPU and GPU!!
- Create fp32 super-node : keep fp32 model state is one partition (a node named UpdateSuper)
- parameter16 is assigned with FWD-BWD super node on GPU
3. Maximizing Memory savings
- maximum memory saving can be done by offloading both gradient16 and Update Super to CPU
⭐️ GPU : FWD-BWD Super node, parameter 16
⭐️ CPU : Update Super node (consists of fp32 states), gradient 16
ZeRO-Offload Schedule
⭐️ Computation & communication schedule for implementing ZeRO-Offload on single GPU system, and further explanations on multi-GPUs
Single GPU Schedule
- Forward pass: compute loss. fp16 parameters are already on GPU, so no CPU communication is needed
- Backward pass: gradient for parameters are computed at different points in backward graph
- ZeRO-Offload transfers these gradients to CPU memory (transferring can overlap with backward propagation on remaining backward graph) $\Rightarrow$ possible to hide communication cost
- Update fp32 parameters and remaining optimizer states on CPU
- Copy updated fp32 parameters from CPU to fp16 parameters on GPU
Scaling to Multi-GPUs
Calculates gradients using ZeRO-2, while offloading partitioned gradients, optimizer states and parameter updates to CPU.
Optimized CPU Execution
⭐️ How speedup of CPU execution is achieved for parameter updates
1. Implementing the CPU Optimizer
Faster implementation of Adam optimizer using SIMD vector instruction, loop unrolling, and OMP multithreading
2. One-step Delayed Parameter Update
CPU computation may become bottleneck when $GPU computation time \leq CPU compute time$
$\Rightarrow$ one-step delayed parameter update (DPU) : overlap CPU and GPU compute
- First $N-1$ steps : trained without DPU to avoid destabilizing training in early stages (gradients change rapidly)
- Step $N$ : obtain gradients from GPU, skip CPU optimizer step, do not update fp16 parameters on GPU either
- Step $N+1$ : compute parameter updates on CPU using gradients from step $N$, compute forward and backward pass on GPU in parallel using parameters updated at step $N-1$
$\therefore$ model at $(i+1)$ step is trained using parameters updated with gradients from $(i-1)$ step, to overlap CPU compute with GPU compute
'Papers' 카테고리의 다른 글
[Papers] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (0) | 2021.04.10 |
---|